sfu-natlang / lensingwikipedia

Lensing Wikipedia is an interface to visually browse through human history as represented in Wikipedia. This the source code that runs the website:
http://lensingwikipedia.cs.sfu.ca
Other
11 stars 4 forks source link

Xkcd-style plot flow diagrams #111

Closed theq629 closed 9 years ago

theq629 commented 10 years ago

(Paraphrased from emails by @anoopsarkar:) As a new view on the frontend, use the entities from the current search to produce a visualization like the ones at http://csclub.uwaterloo.ca/~n2iskand/?page_id=13, which is based on xkcd #657. The x-axis is like in the current timeline plot but may need more scrolling see detail. The y-axis groups entities (especially persons, but we could also use organizations, etc.) into clusters based on co-occurrence in the same events at any given year, and is sorted by frequency. There is a d3 plugin for Sankey diagrams that should work. We can optionally have the thickness of the flow lines represent frequency, but the initial goal should just be to replicate the Waterloo visualization.

theq629 commented 10 years ago

On Sankey diagrams, in my understanding the main purpose is to represent proportional flow, so I suspect that it isn't really the best term here but I'm not sure what the better term would be. Regardless, it looks to me like the d3 plugin will work fine. At some point we might need to work out how to draw nicer line start and end points.

On how to do the processing, the plan is for the backend to produce per-entity timelines indicating which reference points or other event clusters the event is in at each year. The frontend then further processes this (if needed; eg to produce more specific clusters for the plot) and draws the plot. We don't want to send a lot of event data to the frontend for bandwith reasons, but it may be worth sending more than the minimal amount for the diagram so that we have some flexibility to adjust diagrams on the frontend without having to change backend code.

theq629 commented 10 years ago

I've put up early work in the xkcdplottimelines branch and live at http://champ.cs.sfu.ca/WikiHistory/latest/whoosh/wikipediahistoryxkcdplottimelines/. Currently it has a very basic interface and I haven't tried to optimize it at all so it's pretty slow. I'm also not trying to fix clusters' positions on the y-axis yet, just letting the d3 sankey plugin puts them where it wants.

The three input areas at the top are starting year, ending year, and entities list. The entities list is comma separated and each item has the format field:value. Mouse over things or otherwise look at the SVG titles to find out what they actually are.

The plot definitely needs some improvements. Unless you like the current look with entity lines spread out so much on the cluster nodes, then I'll probably need to edit the plugin on just do the diagram manually. I already have to ignore the plugin's x-axis placement choices and we'll probably want to do the same on the y-axis, likely messing up its layout.

Additionally, assigning links with branching entity lines needs work. Currently it always looks one time step back when making links, always making a direct link when the entity stays in the same cluster and otherwise adding an link from an arbitrary cluster. This can produce jumpy results (eg for person:Hannibal around 205-210BC), but I'm not sure that a better algorithm is totally straightforward.

anoopsarkar commented 10 years ago

Looks good for a first step. Is it possible to make the timeline longer than the window size and scroll left and right to see more of the timeline?

theq629 commented 10 years ago

Yes, I can add a scroll bar. And eventually I'd like to do something with a brush select for zooming and limiting the range (like for the regular timeline zoom), but I'd like to wait until the basic plot is more settled first.

theq629 commented 10 years ago

Here are some things I would like to clarify:

  1. How to distinguish clusters. If I understand what you said previously in email, you were wanting to fix the position for each cluster on the y-axis, so all the nodes for that cluster are at the same vertical position. Is that right? If we want to have the distinct clusters clear on the diagram, then the only alternative I can think of would be to colour code them like in the sankey plugin example. But I think using colours to distinguish entities as well would be too confusing then, so we'd need to use distinct line styles on the links or just distinguish entities with graph structure.
  2. I think it would be clearer and tidier if all the links for a single entity going into the same node went to the same spot on the node rectangle, not spread out like they are now.
  3. How to choose links. If we say have an entity that is in cluster A at 10 CE and cluster B at 20 CE, then we just need an A-B link for these years. But say it is in clusters A and B at 10 CE and then clusters B and C 20 CE. I think it's clear that there should be a B-B link. I think it's best to keep all parts of the graph for the entity connected, so we also need to connect 10 CE A to something at 20 CE, and there is no need to connect it to more than one 20 CE node. Similarly 20 CE C needs to connect to something at 10 CE. Right now I just chose these links arbitrarily, and that could be improved with a similarity metric for the clusters, presumably geographic distance for the current clusters.
anoopsarkar commented 10 years ago

I don't see the different colors for the different person entities. All the mouseovers say "Hannibal" for me.

About each point above:

  1. I think the vertical placement of cluster nodes is less of an issue as long as they are spaced out. The d3 demo of Sankey allows the user to drag these nodes around on the y-axis to get a better view. It's ok if the lines get a bit cluttered since we can use a mouseover on the line to see the entity.
  2. The uwaterloo link up above has a css style that might solve this problem perhaps?
  3. I think the best thing to do here is to merge the clusters until there is only one cluster per entity for each year. So in your example, -----(A,B)@10CE-----(B,C)@20CE------. If there are two entities, say entity X and entity Y. Entity X is in cluster (A,B) in 10CE and Y was in cluster (B,C) in 10CE. And then X is in (D) in 20CE and Y is in (D,E) in 20CE then we would merge them all to get =====(A,B,C)@10CE======(D,E)@20CE====== where === represents the two lines for X and Y. Does that work?
anoopsarkar commented 10 years ago

By the way, if we do want to cluster points differently we should consider DBSCAN:

theq629 commented 10 years ago

This is how the diagram looks for me: the green (?) lines are person:Hannibal and the blue (?) lines are person:Scipio Africanus.

xkcdplottimelinesscreenshot1

anoopsarkar commented 10 years ago

Ah, I see. I didn't expect Scipio to be only at the end and I was confused by two different lines for the same person initially. I think I said something differently earlier on, but I think one line per entity seems the least confusing. We would have to merge clusters to make this happen.

anoopsarkar commented 10 years ago

It would also help to highlight the entire line for each entity on mouseover (as in other Sankey demos).

theq629 commented 10 years ago

Yes, I'll change it to highlight the whole line for an entity. The confusingness of branching entity lines is part of why I'd like it to put the endpoints for together on the node. But as you say, this will be much less confusing if there is one node per entity. I think your description for (3) will work, although we may have to see how much the initial clusters overlap in practice.

theq629 commented 10 years ago

For the rest, to explain how the sankey plugin works: the plugin does layout to produce positions and sizes for the nodes and end points and sizes for the links. The associated example code then draws rectangles and curves accordingly, and I modified it a bit to get the current plot. The layout part is allocating space for much wider flow-proportional link lines, since it seems to scale them to the available screen space. As far as I can tell there isn't any way to disable that (and it's not just a CSS issue). Similarly I'm having to bump the nodes to the correct x-axis positions after layout since I don't see any way to constrain the layout positions, and that's probably part of why the layout is poor.

So I think probably we should either give up on using the sankey plugin or heavily modify it. If we do want to fix cluster y-positions across time then I think implementing custom layout won't be hard. Otherwise we can probably use one of the more general d3 graph layout implementations.

anoopsarkar commented 10 years ago

Have a look at what they say at the bottom of http://csclub.uwaterloo.ca/~n2iskand/?page_id=13

Within the y-range of a cluster, each character that appears in a scene whose median cluster is that cluster gets a unique y-position. Now, there are several ways to go about determining the exact position of the node within the cluster’s y-range. We could average the cluster-positions of the characters appearing in it (i.e. the positions of the characters within that cluster), or apply the heuristic we initially used to determine the y-range on a smaller scale. Both of these ideas resulted in ugly cluttering, which in retrospect should’ve been expected– what’s common to all the scenes placed in cluster x’s range is that they’re all dominated by characters from cluster x. Therefore, when we take into account the within-cluster positions of every character in the scene, they all end up getting placed at approximately the same y-position, causing ugliness. Currently, we’re averaging the positions of all the non-x-cluster characters in the scene if any exist, and of the x-cluster characters otherwise.

anoopsarkar commented 10 years ago

They say "heuristic" in the previous link, but I wonder if they fiddled with the placement by hand.

anoopsarkar commented 10 years ago

The layout looks a lot nicer when the year constraint -210 to -180 is added. Make me think that scrolling left or right might do the trick when it comes to the crowded feel of the zoomed out view. Although having one line per entity will also help.

theq629 commented 10 years ago

If I understand the uwaterloo system correctly, they are first doing a character clustering based on global scene co-occurrences and then placing scene nodes within a y-axis bands for each character cluster. So when they say the "y-range of a cluster", that's referring to grouping of characters that we don't currently have any analog to. Should we be doing a similar DBSCAN step? An alternative in our case (as long as we continue using the geographic clusters as the underlying data) is to try to make the y-axis roughly represent geographic distance, perhaps projecting clusters geo-points onto a most-separating axis and then respacing to look better. I'm not totally sure that we should expect their procedure to work for us since the entity being in multiple clusters at once case must occur a lot more in our data; however that may not matter if we are merging clusters.

Everything is section 3 of the uwaterloo page is done by their javascript (http://csclub.uwaterloo.ca/~n2iskand/comics/narrative/narrative.js), but unfortunately it does not appear to be licensed.

theq629 commented 10 years ago

Oh, and additionally I think we probably don't want to be doing much processing for this on the backend unless it can be done globally for the whole data set, so if we do DBSCAN or anything we need to do it on the frontend in javascript. The uwaterloo implementation seems to do that in the python (the input to the javascript is eg http://csclub.uwaterloo.ca/~n2iskand/comics/narrative/luckyluke6_narrative/narrative.json and http://csclub.uwaterloo.ca/~n2iskand/comics/narrative/luckyluke6_narrative/characters.xml).

anoopsarkar commented 10 years ago

I don't understand why they have a character cluster if they also have a spatial localization in a frame of the comic. Perhaps they are clustering frames of the comic which in our case would be to group together our spatial clusters. So our plan seems to be fairly reasonable.

I agree we should do the clustering before we launch the backend (if needed, looks like we can reuse our existing clusters).

theq629 commented 10 years ago

If I'm understanding correctly, then they are making global character clusters based on co-occurrence across all scenes, and then when they draw the plot they first assign these character clusters positions on the y-axis (trying to put large clusters far apart). Then they produce a Sankey node for each scene (corresponding to our geographic-cluster@year nodes) and position each scene by the y-position for character cluster that's best represented in the scene. Then finally they add lines for each character, ordering the line positions within each scene node according to some procedure I'm not understanding.

So if that's right then they aren't directly clustering frames or scenes (I assume scenes are sequences of frames determined somehow), but scenes are assigned to the best matching character cluster, producing a sort of clustering of scenes.

In our case, we could definitely do some comparable fixed co-occurrence based clustering before starting the backend. However, it's possible that doing it on the frontend based on what entities are selected for a particular plot would give more relevant results.

Also note that while they do have localization by frames or scenes, it's not exactly spatial in the same way that our geographic clusters are in that it doesn't give any distance metric unless they've added that by hand.

anoopsarkar commented 10 years ago

The greedy grouping our existing geographical clusters might produce the same effect for us. So perhaps we can try that first. We would still need to do placement but we can use their heuristic for this?

anoopsarkar commented 10 years ago

The geographical clustering could give us the character groups we need? Can you work out an example to see if this works?

theq629 commented 10 years ago

How does the geographical clustering give us character groups? (I mean, I think what you already said about merging clusters will probably work fine to make a working plot, I'm just not sure if it's the same as having global character/entity groups.)

anoopsarkar commented 10 years ago

A global character group for them seems to be simply to select where they start on the left hand side. They enter and leave different groups on the y-axis anyway (e.g. Ma Dalton is in group 0 with 2/3 others, but the line curves around to enter and leave different groups/clusters in the plot).

I think we will get big clumps as we keep merging our geoclusters to form groups at each timeline point. Allowing the user to rearrange the nodes might be sufficient to make things less cluttered (or at least takes us off the hook).

theq629 commented 10 years ago

Ok, in that case I'll try merging geoclusters first.

anoopsarkar commented 10 years ago

Are we stuck on this issue?

theq629 commented 10 years ago

Yes, I haven't been feeling very well so I still haven't got the new clusters working yet.

anoopsarkar commented 10 years ago

Oh, OK. Feel better. I wanted to know if it was a conceptual issue.

theq629 commented 10 years ago

Ok, cluster merging is in. I also made the backend handling for it a fair bit faster, and made the frontend highlight the whole entity line on mouseover, along with changing the node style slightly to make that more visible.

I haven't played too much with it yet, but you can try eg "person:Hannibal, person:Scipio Africanus, person: Antiochus III the Great, person:Philip V of Macedon" for a more complex plot than before and "person:Wang Mang, person:Julius Caesar" to see what happens with totally non-overlapping geographic clusters.

With cluster merging I think the main issue is that entity lines enter and leave the cluster nodes at totally different places. Again I don't think that is changable with the Sankey plugin, so if you think the merging looks ok then I think the next step could be to work on a new drawing method.

theq629 commented 10 years ago

I didn't do the scrollbar yet. After thinking about it again, I was thinking it might be better to use a double-view brush interface like for the timeline (which includes dragging like a scrollbar). How does that sound?

anoopsarkar commented 10 years ago

I like it, except for the entry and exit being at different places as you mentioned. Is that what you wanted to fix with the new drawing method?

I think a double-view interface like the timeline would be good.

A few other things:

  1. call the view something other than Plot. Perhaps Flow?
  2. can we click on the cluster group to select it?
  3. can we initialize the view to the people in the current Person constraint. In this way, the search term, e.g. "person:Wang Mang, person:Julius Caesar" should go into the Textsearch box instead of the Plot window.
theq629 commented 10 years ago

Yes, that was the main thing I wanted to fix with a different drawing method. I also don't really like the wide node boxes right now. I think I'll look at other d3 graph layout options and if nothing is suitable then try to adapt the uwaterloo method.

  1. Sure, that seems better.
  2. Yes, but what should happen when it's selected?
  3. Yes, but in the long run do we also want to support other types of entities (eg organizations or countries)?
anoopsarkar commented 10 years ago
  1. OK
  2. It would be like a geographic cluster selection. It should show up as "3 markers" or whatever in the list of constraints and the map view should be updated with the selection as well. Does that make sense?
  3. Yes, we could provide a dropdown list of choices based on the primary list facet views (maybe all of them to keep things general).

One more thing, when I do a textsearch for "person:Wang Mang, person:Julius Caesar" it does not accept this as a valid search.

anoopsarkar commented 10 years ago

For point 3 above, first let us get things working with just the Person facet.

anoopsarkar commented 10 years ago

This works in textsearch: person:(Scipio Africanus) OR person:(Hannibal)

Is this the semantics of: person:Scipio Africanus, person:Hannibal?

theq629 commented 10 years ago

For 2, do you mean clicking on the node boxes, or the lines? Although I think either is possible.

At some point we should also settle on better internal terms, as we currently tend to say 'cluster' for several things.

The plot text box doesn't use the real search interface, it just splits on commas and then colons to get field-value pairs. So those two examples should make the backend look at the same index items. I think the full text search format doesn't work for this case since we have to identify field-value pairs to use as entities. Now that I think about it that also makes it difficult to use the query from the text search tab, but the query from the person facet is easy to use.

I haven't been thinking of the plot text box as the real interface, by the way, it was just the fastest thing to get working.

anoopsarkar commented 10 years ago

For 2, I mean clicking on the node boxes.

Yes, we should settle on some terms for each part of the system/interface.

I like that when we click on Hannibal in the Person facet we get a good set of other people to show in the Sankey diagram. We can truncate to the first view as we do in the facet listing.

theq629 commented 10 years ago

I came up with an initial custom layout method. It basically does an unprincipled iterative relaxation to positions nodes with the constraint that all x-positions are fixed, and then greedily assigns line positions (in each node) greedily from the left. I've temporarily added the number of relaxation iterations as a text field in the interface.

I also renamed the tab to "Flow" as suggested above.

The specific algorithm is:

- make a node for each (merged) cluster, and a line for each entity
- for each year, set node y positions to be spread out evenly
- for some relaxations iterations:
    - move each node to the weighted average of the y positions of its neighbours
    - bump the nodes around to not overlap (plus a spacing gap)
    - bound the nodes to the visible area
- for each node by increasing year:
    - give the entity lines going through the node y positions based on the y positions of the previous node on each line

The weighted y position average uses weights based on the number of shared entity lines and the time distance, specifically (num shared entities) / (time distance).

theq629 commented 10 years ago

So far this looks nicer to me, but it definitely has some places that could be improved. With the current default query (person:Hannibal, person:Scipio Africanus, person: Antiochus III the Great, person:Philip V of Macedon, person:Qin Shi Huang) the lines around the 215BCE three-entity node could be straighter, and the short line on the right for person:Scipio Africanus is obscured.

I suspect it would help to move the line y position assignment step inside the relaxation loop and do the node position updates based on individual connecting lines. Eg the nodes on either side of the 215BCE node are higher than they should be since they are aligned to the position of whole 215BCE node rather than the two shared lines. But I'm not sure yet if redoing the line positions each iteration is too much work inside the loop.

Another interesting point is the way the upper 187BCE node gets pushed up. That probably indicates that I have the node gap distance set too high, but I think also suggests that maybe other factors could be considered in weighting neighbors for the position average.

Screenshot for future reference: xkcdplottimelinesscreenshot2

anoopsarkar commented 10 years ago

Looks good. I was a bit confused about the colours. Two different lines seemed to me to be the same colour but probably they are not? We need some method to scale to a larger number of entities. For now and possibly this is a good idea as seen in the xkcd plot we should add the names for each line explicitly instead of mouse over. Also I wonder if there is a way to label the geoclusters? Perhaps the most frequent location name within that cluster?

On Thursday, July 10, 2014, Max Whitney notifications@github.com wrote:

So far this looks nicer to me, but it definitely has some places that could be improved. With the current default query (person:Hannibal, person:Scipio Africanus, person: Antiochus III the Great, person:Philip V of Macedon, person:Qin Shi Huang) the lines around the 215BCE three-entity node could be straighter, and the line for person:Scipio Africanus is obscured.

I suspect it would help to move the line y position assignment step inside the relaxation loop and do the node position updates based on individual connecting lines, but that means a fair bit of extra work inside the loop.

— Reply to this email directly or view it on GitHub https://github.com/sfu-natlang/lensingwikipedia/issues/111#issuecomment-48590170 .

theq629 commented 10 years ago

I don't really understand the colours either, I use d3.scale.category10() keyed on the index of the entity so I thought they would come out more distinct. I can just switch that for any other colour set, though.

We can definitely add the names for lines. I'm not sure if it will work visually to put all the cluster labels visible without mouse-over, but we could try coming up with better names. We might also be able to group nodes horizontally into larger boxes, add visible labels on those, and let the user mouse-over for specifics.

Note that we can already cluster on the location field instead of the reference points (geo clusters in the database). I just tried that, and it seems to produce similar but not identical (especially lacking the three-way overlap at 215BCE on the default query) results.

theq629 commented 10 years ago

One more note on the layout: the 185BCE node for the default query, and the Hannibal - Philip V of Macedon overlap at 220BCE on the location clusters version are confusing since a line goes through a node it isn't actually part of.

anoopsarkar commented 10 years ago

Hmm. How do we make sure that doesn't happen? Maybe a change of color or lines that loop over like in a circuit diagram. On Jul 11, 2014 4:26 PM, "Max Whitney" notifications@github.com wrote:

One more note on the layout: the 185BCE node for the default query, and the Hannibal - Philip V of Macedon overlap at 220BCE on the location clusters version are confusing since a line goes through a node it isn't actually part of.

— Reply to this email directly or view it on GitHub https://github.com/sfu-natlang/lensingwikipedia/issues/111#issuecomment-48701671 .

anoopsarkar commented 10 years ago

Cluster on location field looks good to me. Let us see how it scales and keep it as the default if it works well. Certainly looks more appealing at this point. On Jul 11, 2014 4:23 PM, "Max Whitney" notifications@github.com wrote:

I don't really understand the colours either, I use d3.scale.category10() keyed on the index of the entity so I thought they would come out more distinct. I can just switch that for any other colour set, though.

We can definitely add the names for lines. I'm not sure if it will work visually to put all the cluster labels visible without mouse-over, but we could try coming up with better names. We might also be able to group nodes horizontally into larger boxes, add visible labels on those, and let the user mouse-over for specifics.

Note that we can already cluster on the location field instead of the reference points (geo clusters in the database). I just tried that http://champ.cs.sfu.ca/WikiHistory/latest/whoosh/wikipediahistoryxkcdplottimelinesonlocation/, and it seems to produce similar but not identical (especially lacking the three-way overlap at 215BCE on the default query) results.

— Reply to this email directly or view it on GitHub https://github.com/sfu-natlang/lensingwikipedia/issues/111#issuecomment-48701501 .

anoopsarkar commented 10 years ago

any progress on this issue?

anoopsarkar commented 10 years ago

Noticed a small bug just now. If you expand the browser window to the full width of a 24 inch display, the canvas becomes squished down vertically (i.e. it scales only in the x-axis after a certain point instead of scaling in both axes).

theq629 commented 10 years ago

Sorry, I've just been busy with school and not feeling well again. I'm going to try to catch up later this week.

On lines overlapping node boxes: a change of colour or loop would be good, but I'm not sure how to detect the overlap in the first place right now. However, I think I can extend the layout algorithm to place all lines to avoid node boxes.

On the plan to use the values of the constraints for the person facet as entities: I was thinking about this again, and it is a bit strange since in every other view those constraints will are taken as a conjunction, but here they will be taken as a disjunction (I mean, if we used them like we do in other views then we'd only have nodes that contain all the entities). But it still seems like the best starting point for the real interface.

anoopsarkar commented 9 years ago

This flow view can be different I that it does not have a default view. It can say: select an entity.

Once an entity is selected the current facet browser shows entities that can be added as a conjunction, so we can show all of them in the flow view.

However, when we select a location entity or a map marker then what happens? I'm not sure. Perhaps just remains at the default empty view. On Jul 27, 2014 6:02 PM, "Max Whitney" notifications@github.com wrote:

Sorry, I've just been busy with school and not feeling well again. I'm going to try to catch up later this week.

On lines overlapping node boxes: a change of colour or loop would be good, but I'm not sure how to detect the overlap in the first place right now. However, I think I can extend the layout algorithm to place all lines to avoid node boxes.

On the plan to use the values values of the constraints for the person facet as entities: I was thinking about this again, and it is a bit strange since in every other view those constraints will are taken as a conjunction, but here they will be taken as a disjunction (I mean, if we used them like we do in other views then we'd only have nodes that contain all the entities). But it still seems like the best starting point for the real interface.

— Reply to this email directly or view it on GitHub https://github.com/sfu-natlang/lensingwikipedia/issues/111#issuecomment-50292046 .

anoopsarkar commented 9 years ago

Here's a way to have an "overview" default view and how to add constraints for the Flow tab:

  1. Have a default facet, say "Person" -- as it will become clear below we do not need to include a facet selector.
  2. Pick the most frequent entity in that facet, e.g. "Augustus" is the most frequent person in the data in the current state.
  3. Pick all other entities in the first page of entities that are currently reported by the backend that are in the same facet, e.g. for "Augustus" it will be the list:
     Mark Antony [49]
     Julius Caesar [34]
     Cleopatra [24]
     Tiberius [24]
     and so on ...
     
  4. Provide a dialog in the Flow tab to ask the user to add "Augustus" as a constraint in the default view. If the user selects a different constraint then the Flow tab silently switches to that constraint.
  5. Any selection from the interface is added as a new line in the Flow, along with all the entities (in the first page view) that are shown in that facet. e.g. if the user selects "Augustus" and then selects "Sparta" then "Sparta" and all the entities that appear in that facet under current constraints is also added, which would add in addition to "Sparta" the following locations:
    Adriatic Sea [15]
    Philippi [15]
    Thasos [15]
    Attica [4]
    Argyroupoli [4]
    Kydonia [4]
    Crete [4]
    

I think this will work in full generality, but I may have missed a corner case (or many).

theq629 commented 9 years ago

After a lot of fiddling I have a method that should prevent lines overlapping nodes they don't participate in. It may need some more work when we find out how it scales, but to me it looks pretty good for now.

There are a couple of bugs (eg put "person:Qin Shi Huang" back in):

The first and third bugs are probably just missing constraints at certain places in the relaxation.

I've also added line labels and the brush zooming interface.

xkcdplottimelinesscreenshot5

theq629 commented 9 years ago

That interface for the default view sounds like it will work, although I also haven't thought through the corner cases yet either. I think we may want to consider keeping something like the current interface as an alternate option for quick searches.

I wonder if we should consider changing the timeline view so that adding constraints there is more explicit (rather than part of narrowing the view). If we did that then the facet view would become the special case where all view changes involve adding constraints, which I think might be more clear.