quanteda / quanteda.textplots

Plotting and visualisation for quanteda
GNU General Public License v3.0
6 stars 1 forks source link

feature coocurrence as a graph -> fcg or fcm_graph ? #3

Open aourednik opened 6 years ago

aourednik commented 6 years ago

Many thanks for developing quanteda! The fcm feature is fast and very useful. This is an enhancement proposal. Trying to apply textplot_network() or as.network() to a large fcm triggers an error: "fcm is too large for a network plot". This makes sense, since a visualization would take too much resources. But it would be nice to be able to convert the whole sparse matrix to a graph for graph-oriented treatment. Since the fcm seems of class dgTMatrix, it should be possible to convert to a graph with T2Graph(), as documented here: https://stat.ethz.ch/R-manual/R-devel/library/Matrix/html/graph2T.html This triggers an error, though. Igraph's function graph_from_adjacency_matrix() makes it possible. But this solution makes us leave the context of quanteda. Maybe a fcg or fcm_graph object would be a useful new feature, where word concurrence would be a graph object instead of a sparse matrix. Features and their links could then be annotated with extra metadata. fcm_graph could also propose export functions to gml or json, for direct interoperability with Cytoscape, Gephi or D3. The following solution does this. Would a tighter integration of this functionality in quanteda be possible?

library("Matrix")
library("igraph")
#txtfcm is an fcm object produced with fcm(tokens(corpus(readtext(set_of_txt_files))))
txtfcm.graph <- T2graph(txtfcm) # Triggers an error: no slot of name "j" for this object of class "fcm"
txtfcm.graph <- graph_from_adjacency_matrix(txtfcm,weighted=TRUE) # works
V(txtfcm.graph)$freq <- rowSums(txtfcm) # gives word frequencies if fcm is not weighted
# Examples of use possibilities opened by a graph approach :
txtfcm.graph <- simplify(txtfcm.graph) # remove loops and duplicate edges
txtfcm.graph <- delete_edges(txtfcm.graph,which(E(txtfcm.graph)$weight<0.5)) # delete weak coocurrences
txtfcm.graph <- delete_vertices(txtfcm.graph,which(V(txtfcm.graph)$freq < 3)) # delete low frequecy words
# associate attributes to each vertex (word.fr.positive and word.fr.negative are lists of words)
V(txtfcm.graph)$negative <- sapply(V(txtfcm.graph)$name,function(x){
  return(x %chin% words.fr.negative)
})
V(txtfcm.graph)$positive <- sapply(V(txtfcm.graph)$name, function(x){
  return(x %chin% words.fr.positive)
})
write_graph(txtfcm.graph,"my_graph.gml",format="gml") # export the graph to gml for use in Cytoscape or Gephi
aourednik commented 6 years ago

Here a coocurrence visualisation example produced with this approach. The fcm is converted to a graph converted to gml and post-treated in Cytoscape; Gephi also imports gml. Fruchterman-Reingold force-directed layout has been applied. Sizes show featur frequencies (The source is a set of diplomatic reports of the Swiss embassy in Stockholm around 1930) my_graph_stockholm1928_1932 gml_3

koheiw commented 6 years ago

Hi @ aourednik thank you for the suggestion and the beautiful plot. The initial version of textplot_network() was actually based on igraph but its bug (might have been fixed by now) prevented us from using in our package (we also preferred to base our visualization functions on ggplot2). We were sure that network analysis experts like you will find out how to convert a FCM into a set of edges.

I run your coded and I noticed that T2graph() works with a FCM if it is coerced to triplets by as(txtfcm, 'dgTMatrix') but it depends on the graph package that is not on CRAN. So we could make a thin wrapper function:

as.igraph.fcm <- function(x) {
    igraph::graph_from_adjacency_matrix(x)
}

if you think this is useful. What kind of meta-data do you want to pass to a igraph object? I was thinking of adding information about overall word frequency the FCM constructor, but don't know what I can do more than that. We might have a lot of meta data for documents, but not for features.

aourednik commented 6 years ago

Hi @koheiv, thank you for your answer. For feature metadata, I was thinking of frequency, and topic, or sentiment, based on a dictionary like for a dfm with dfm_lookup(). This to be able to color and size the word-nodes in a visualization. Original frequencies stored as word-level-metadata would also be useful when generating a weighted-by-distance fcm, since rowSums of the fcm matrix then yields floating point numbers and the original word-occurrence-counts in the overall tokens object is "lost".

The idea would be to be able to do something like the code below but with less code, by having things wrapped in a fcg or fcm_graph object.

feat <- names(topfeatures(txtfcm, 200)) # (in the example, I limit to 200 nodes, but it would be great if the graph object could contain as many nodes as the fcm can contain features)
topfeat <- fcm_select(txtfcm, feat, verbose = FALSE)
samplefreqs <- as.data.table(textstat_frequency(dfm(topfeat)))
setkey(samplefreqs,"feature")
vsize <- sapply(rownames(topfeat),function(x){return(sqrt(samplefreqs[x]$frequency))}) # alternative: vsize <- sqrt(rowSums(topfeat)) but when the frequencies are weighted by distance, this result is weird
vcolor <- sapply(rownames(topfeat),function(x){
  if (x %chin% words.fr.joy) {return("red")} else return("black")})
textplot_network(topfeat,min_freq = 0.5, vertex_color = vcolor, vertex_size=vsize / max(vsize) * 7)

In the last line, when I do textplot_network(), I basically rely on vsize, vcolor, and topfeat having the same number and order of rows. If I create a new subset by fcm_select(), I need to rerun all lines of code. Among the abilities of a fcm_graph object, I was imagining a function like fcm_select(fcm_graph,feat) that would be able to create a subset while retaining the associated word-level metadata.

The fcm_graph object could also have a write_to_gml(), write_to_d3json() and/or as.igraph() method.

aourednik commented 6 years ago

Another use I would see for a graph approach would be to select the co-occurrence neighborhood of specific words. For example, to select all negative words and their immediate neighbors, one needs to do something like this after conversion to igraph:

v_of_interest <- which(V(txtfcm.graph)$negative)
txtfcm.graph <- subgraph.edges(txtfcm.graph,E(txtfcm.graph)[inc(v_of_interest)])

Actually, the result is unsatisfying, the problem being in igraph not providing a function for making a coherent subgraph based on a set of nodes. make_ego_graph() could be expected to do this but it does not, as described here. Currently, only Cytoscape does this as needed.

Great would be something like this (with imaginary functions):

myfcgraph <- fcm_graph(mytokens) # pre-assigns "name" and the marginal frequency "frequency" as node-level attributes
# assuming that "polarity" is a data.table with two columns "word" and "pol" 
setkey(polarity,word) 
V(myfcgraph)$polarity <- sapply(V(myfcgraph)$name,function(x){return(polarity[x]$pol)}) 
v_of_interest <- myfcgraph[polarity != 0]
feat <- neighbors(v_of_interest,1) # gets first-order neigbors, like the ego() function in igraph 
myfcgraph <- fcm_graph_select(fcm_graph,feat)
textplot_network(myfcgraph,min_freq = 10, vertex_color = polarity, vertex_size=sqrt(frequency))

The code would yield something like this

my_graph_stockholm_1928_1932_limited gml_1 1