visualizing word context graphs

jacksonllee commented 10 years ago

When we speak of word contexts, we've been talking about Python strings such as "of very", "reasons for " as contexts for the word "the", for example. Simon has produced a very helpful tool which displays word contexts for selected words alongside a word neighbors graph.

Issue: Individual words have their own very specific word contexts, and there are a lot of them for each word. This effectively means that each word is associated with a very long list of context strings, which are less than easy to compare. But it is apparent that the word contexts are partially similar (e.g., "for very" and "of very" share the right context of "very"). It might therefore be helpful to make use of this property for visualization and comparison.

What's done: I'm thinking of putting the contexts together as a directed graph for a given word. findManifold.py has been updated to output word context graphs as .gexf using the networkx package. A sample file is 0078_may_1748.gexf (the 1748-node word context graph for the word "may" which is the 78th word in the word list); the node with the label "_" is where the word "may" goes. Another sample file is also available: 0074_could_1567.gexf

To do: Visualize word context graphs like the .gexf files mentioned above. The layouts provided by default in Gephi don't look particularly helpful. Currently looking through Jason's resource page for ideas. Maybe something similar to the word tree visualization we've checked out earlier is a possibility.

sdjacobs commented 10 years ago

I'd recommend taking a look at using an adapted Sankey diagram. It's a directed graph where the width of edges corresponds to the weight of that edge, intended to show flow in a graph. If we lay out the elements sensibly, it could be an informative way to present the word contexts.

Here is a very simple D3 example, based on context data that I made up. The diagram is intended to show the (fabricated) contexts of "house" and "cat" (data).

In order to make this workable for large datasets, we would need to rework the layout algorithm. Ideally, words would be aligned in the center, their left contexts on the left, and their right contexts on the right. The D3 layout algorithm also times out for large data, but this shouldn't be an issue for us, basically since the existing algorithm does everything from scratch but we know where we want things.

sdjacobs commented 10 years ago

Take a look at this new demo. It uses the english-brown.json file based on data you sent me a few weeks ago. By default, it displays the contexts of the word "oh", but you can use request parameters in the URL to look at different words -- for example, this.

What do you think? I found that the visualization was too cluttered with words that have lots of contexts, but if this seems like the right direction to go in, we can brainstorm some ways to clean it up.

rcc-uchicago / ling-viz

visualizing word context graphs #6