igraph / rigraph

igraph R package
https://r.igraph.org
542 stars 200 forks source link

R: graph.data.frame converts factors to character #34

Open gaborcsardi opened 9 years ago

gaborcsardi commented 9 years ago

From @gaborcsardi on July 26, 2014 3:42

Add an option to keep factors as factors. See http://stackoverflow.com/questions/24965840/igraph-graph-data-frame-silently-converts-factors-to-character-vectors

Copied from original issue: igraph/igraph#665

gaborcsardi commented 9 years ago

From @elbamos on January 6, 2015 5:25

I'm writing to join in this request... In the first place, as a matter of R, it shouldn't be altering a variable type to or from factor silently, because the factor data definition contains information that's important in, e.g., regression. Similarly, factors are the natural data type for some graph-relevant data, like community membership.

Setting vertex colors should also be through factors in vertex attributes; if the graph is going to be visualized with ggplot2 or ggvis or the like, there's a whole framework for factor aesthetics.

This seems like a super-easy thing to fix/add/change. if I just do this, will you take the pull request? And if so, how would you prefer it implemented -- I'm thinking its a graph-level "stringsAsFactors" preference set at graph creation.

gaborcsardi commented 9 years ago

There are several problems with factors. One is that you cannot write them to standard file formats. I mean, you can, but the fact that they are factors is lost. (There are no factors in GraphML, GML, etc.)

Another one is that you cannot even easily create a factor attribute in igraph currently:

g <- make_ring(10)
V(g)$foo <- factor(letters[1:10])
V(g)$foo
#>  [1]  1  2  3  4  5  6  7  8  9 10

g <- set_vertex_attr(g, "bar", value = factor(letters[1:10]))
g
#> IGRAPH U--- 10 10 -- Ring graph
#> + attr: name (g/c), mutual (g/l), circular (g/l), foo (v/n), bar
#> | (v/n)
V(g)$bar
#>  [1]  1  2  3  4  5  6  7  8  9 10

So at least this needs to be changed, but there are a lot of potential hiccups. In general, vertex/edge attributes that are not atomic builtin classes are not handled well in igraph.

igraph does not use ggplot for graph drawing, so I don't really see how factors would help with graph drawing. Also, why are factors natural for community membership? Maybe if you name your communities. Otherwise simple consecutive integer numbers are just as natural, and making them factors is just an unnecessary complication inho.

gaborcsardi commented 9 years ago

From @elbamos on January 6, 2015 5:54

Well, one function of the igraph package is plotting. Another is generation of certain statistics. A third, though, is that its a data structure with a very convenient, well-thought-out syntax for creating, editing, manipulating, etc. graphical data.

igraph doesn't use ggplot for plotting. igraph objects, though, can be fed into plotting systems other than igraph's built-in plotting. This is what GGally::ggnet does and I've tried to do with ggnetwork.

Why are factors natural for community membership? Well, because community membership is categorical data. More practically, consider this workflow:

vinfo <- data.frame(bunch of data about nodes including dat1 and factor2)
graph <- graph.data.frame(edges, vertices = vinfo)
V(graph)$astat <- igraph::a_stat_function(graph)
V(graph)$comm <- igraph::a_community_membership_function(graph)
graph %>% get.data.frame("vertices") %>% glm(dat1 ~ astat + comm + factor2)
or even
graph %>% get.data.frame("vertices") %>% glm(dat1 ~ astat + comm)

Without factors, that obviously will produce gobbledygook. This is a simple contrived example. Doing a lot of analysis to see how network structure relates to some other variables, being able to store factors in igraph would really simplify the workflow.

gaborcsardi commented 9 years ago

From @elbamos on January 6, 2015 5:56

I'm not sure I caught exactly what you meant about the implementation issues. I see where file formats are an issue, but that's not really a solveable one, and doesn't seem like a show-stopped to me. The other issues, I understood from the stackoverflow discussion about this, that it seemed that igraph was simply checking variables and converting all the factors to characters. So the project seemed to be going through the code, picking all that out, and then flyspecking whatever broke.

Is it a lot more than I was thinking?

gaborcsardi commented 9 years ago

These are some good points.

What I meant by the code above is that if factors are first class data types in igraph, then there should be ways to create them. Other than graph.data.frame, which is just a special case. set.*.attribute should support factors.

Another potential error that comes to mind immediately is the name vertex attribute, that is treated specially, and I am not sure if everything works if it is a factor. Probably not.

As for representing community membership as factors, that is probably OK, because it is represented by 1:k anyway, and factor levels would match their internal representation.

In general I am a bit ambivalent with factors. They are definitely a good idea, but the way they are implemented in R, you can get some surprising behaviour out of of them. E.g. the way data.frame converts strings to factors, is just wrong.

In summary, I don't mind trying to

gaborcsardi commented 9 years ago

From @elbamos on January 6, 2015 6:24

I agree with you on all counts. Its easiest to just not let names be factors, I think. That is a special case, as you say. I also agree that R can sometimes be surprising about them. But once one gets used to them and their purpose, that funny variable type is really invaluable.

Thank you for your attention to this.

elbamos commented 9 years ago

I saw that you closed this... does that mean you're dropping it? Is there any way I can help?

gaborcsardi commented 9 years ago

As you can see, it is open. Just moved the R package in a separate repo.

thomasp85 commented 8 years ago

Is there any work on this? It is especially pertinent for ggraph, in terms of allowing people to order scales as they would normally do in ggplot2...

maelle commented 6 months ago

reprex from the original Stack Overflow example.

library("igraph")
#> 
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:stats':
#> 
#>     decompose, spectrum
#> The following object is masked from 'package:base':
#> 
#>     union
actors <- data.frame(
  name = c("Alice", "Bob", "Cecil", "David", "Esmeralda"),
  age = c(48, 33, 45, 34, 21),
  gender = factor(c("F", "M", "F", "M", "F"))
)
relations <- data.frame(
  from = c(
    "Bob", "Cecil", "Cecil", "David",
    "David", "Esmeralda"
  ),
  to = c("Alice", "Bob", "Alice", "Alice", "Bob", "Alice"),
  same.dept = c(FALSE, FALSE, TRUE, FALSE, FALSE, TRUE),
  friendship = c(4, 5, 5, 2, 1, 1), advice = c(4, 5, 5, 4, 2, 3)
)
g <- graph_from_data_frame(relations, directed = TRUE, vertices = actors)
g_actors <- as_data_frame(g, what = "vertices")

# Compare type of gender (before and after)
is.factor(actors$gender)
#> [1] TRUE
is.factor(g_actors$gender)
#> [1] FALSE

Created on 2024-02-26 with reprex v2.1.0

krlmlr commented 5 months ago

Old implementation by @thomasp85: #193.