sfu-natlang / lensingwikipedia

Lensing Wikipedia is an interface to visually browse through human history as represented in Wikipedia. This the source code that runs the website:
http://lensingwikipedia.cs.sfu.ca
Other
11 stars 4 forks source link

Xkcd-style plot flow diagrams #111

Closed theq629 closed 9 years ago

theq629 commented 10 years ago

(Paraphrased from emails by @anoopsarkar:) As a new view on the frontend, use the entities from the current search to produce a visualization like the ones at http://csclub.uwaterloo.ca/~n2iskand/?page_id=13, which is based on xkcd #657. The x-axis is like in the current timeline plot but may need more scrolling see detail. The y-axis groups entities (especially persons, but we could also use organizations, etc.) into clusters based on co-occurrence in the same events at any given year, and is sorted by frequency. There is a d3 plugin for Sankey diagrams that should work. We can optionally have the thickness of the flow lines represent frequency, but the initial goal should just be to replicate the Waterloo visualization.

anoopsarkar commented 10 years ago

I was at Canvas 2014 last week and I was informed that this visualization is called a Storygraph nowadays and there are two papers on the topic:

The Tanahashi paper seems to provide a detailed layout algorithm that we can use instead of the d3 Sankey diagram.

theq629 commented 10 years ago

It looks to me like the papers are these two:

At a quick look both of these seem to produce much nicer results, using different optimization methods. As far as I can tell these isn't code available for either method, although they are hard to search for since both "storygraphs" and "storyflow" turn up other things. Do you want me to look into implementing them (which may take a while), or continue with what we have for now?

anoopsarkar commented 10 years ago

Instead of reinventing stuff, why not use the Ogawa algorithm for creating SVGs that is available as open source in this project called evolines

theq629 commented 10 years ago

The algorithm would also work, but the code seems to be in Processing and relies on graph layout code from Java Jung.

anoopsarkar commented 10 years ago

Ah, I should've checked that first. But perhaps we can use Ogawa's algorithm which seems to be greedy. Hopefully having the source code might help. One thing I am still not clear about is the placement of entities on the y-axis.

theq629 commented 10 years ago

Yes, at a first look it seems like the algorithm would be easier. I'll go over it in more detail tomorrow.

theq629 commented 10 years ago

The paper for the Ogawa algorithm is this one:

After looking at this more it seems like the algorithm in the paper is simple because they don't specify any details (since the paper focus on the visual aspects instead). The algorithm is greedy going through the timesteps and each cluster at each timestep, but with no optimization step. The cluster y-positions are "Based on already-placed developer[ie entity] tube" y-positions per cluster, which I think (after looking at the code) means the position from previous timesteps that have already been covered. There are few other specifics on how to place clusters.

I don't understand the code too well yet, but it doesn't look too complicated (I think the important parts are in Cluster.pde and TubeBus.pde, about 500 lines total). Note that the code is licenced under the GPL, and since many parts are not covered in the paper, anything we implement based on the code will presumably need to be GPL too.

I also looked at the other papers more. Tanahashi and Ma use genetic algorithms. The performance doesn't sound to scale very well, but I'm not to familiar with GA approaches so I'm not sure how it compares for practical purposes. Liu et al. use linear programing (with Mosek, which appears to be a Python LP library). We'd need a Javascript LP solver for this algorithm. Both approaches have a lot of extra details, and it's hard to tell which are important visually.

theq629 commented 10 years ago

We could also just start from Ogawa paper algorithm and make up our own heuristics for the unspecified steps.

A possible intermediate possibility for now is to adapt what I have already to the Ogawa paper, either using some of the structure of the algorithm or just trying to use their conclusions about visual appearance.

I've added more entities to the default visualization on the live demos, and I've added http://champ.cs.sfu.ca/WikiHistory/latest/whoosh/wikipediahistoryxkcdplottimelinescloselines/ with spacing more like Ogawa; it keeps lines in the same cluster close together, although right now it doesn't always properly keep other lines far enough away.

anoopsarkar commented 10 years ago

why don't you try adapting what you already have to the conclusions of the Ogawa paper.

theq629 commented 10 years ago

Ok, I'll start working on that approach.

theq629 commented 9 years ago

I've committed and put up live sites of a new version. It still doesn't look too much like Ogawa, but I think it's largely that our data has different properties in terms of how often event lines are actually on the same node.

Known issues:

I think if that looks ok for now then I should probably stop playing with the layout, maybe work on the interface and then come back to the layout after we try using it more.

anoopsarkar commented 9 years ago

Yes, please. Let us integrate into the rest of the interface, and see how it looks before playing with the layout. For the integration, do we want to:

  1. choose the type of entity to be listed on the y-axis. e.g. choose Person or Location as the facet to be used for the y-axis. Selection on the facet view would constrain the storyline view.
  2. use text search as it does now, but integrated to the facet, map and timeline view. this would allow a mixture of entity types on the y-axis of the storyline

I like the former idea, but I'm open to alternatives.

No matter what we should do the following:

  1. I liked the names being explicitly provided but the names are missing now.
  2. Change name from Flow to Storyline
  3. Allow selection of the cluster (the event cluster) to narrow down the constraints. This will be similar to the maps view in which the constraints will be 1 or more markers.

Optionally:

  1. Hovering over each cluster could perhaps show a list of the events in that cluster. We could do the same in the map view and timeline view. This will give an immediate view of the text without selection.

Briefly about layout:

I feel that we can use a generic machine learning methodology to minimize divergence from straight lines instead of the complex genetic programming approach from the Tanahashi et al paper, or the even more complex search in the Liu et al paper.

theq629 commented 9 years ago

For integration, I wonder if maybe it's best to allow both 1. and 2., at least until we can play with it more. I'm imagining that where the text box goes right now there can be a drop down menu that allows a selection of either a single facet or 'enter query'. If you select enter query, then the text box appears next to the drop down menu and is initialized to the same thing the facet view would have been searching for.

However, what kind of of integration with the map and timeline view are you picturing for 2.?

The explicit names missing is just me forgetting to uncomment that after testing, so I'll put them back and do the other changes. For 3., should the constraint show up as a map constraint, or a new type?

anoopsarkar commented 9 years ago

I like the combination of 1. and 2. Let's do that.

The integration with other views is tied into selections in other views as well as in the storyline view. Basically the same way it works now between, let's say, facets and the map or vice versa.

For 3. I think it should be a new type: 1 story-event or 2 story-events, etc. (or a better name than that)

theq629 commented 9 years ago

For the layout, as I understand it the other (besides keeping straight lines) goal from Ogawa is to minimize line crossings. What do you think about algorithms to do both together?

Another design consideration here is whether to use fixed horizontal slices of the screen for lines (as Ogawa does) or to allow free placement (as I do right now). My current system has a lot of problems with bumpy lines, although I think that's mostly the overlap fixing part.

theq629 commented 9 years ago

Ok, I'll work on the 1. and 2. together option, then.

I think at some point we should consider more generally what kind of integration we want between views. The initial constraint managing code was designed around keeping views independent, but it seems like at some point we will want more overlap.

theq629 commented 9 years ago

All non-optional points above pushed and on the live demos.

On node selection: If you click a node it now adds a constraint. Click again to remove. Here the constraint works like the map view, in that it's treated as a single constraint internally and visually. So far this only supports the geo cluster version and not the location field version (the demo of which is still live), but it's easy enough to add to the later.

On the facet mode: When a facet is selected for the storyline view, the visualization is produced based on all global constraints except for those for that facet. I think this matches what we said previously. That is, selections in the facet constrain which entities appear on the storyline view, but do not constrain which events are used to produce the storyline. However, selections in other facets or other views do constrain the events the storyline uses.

On the defaults for the text query: Above I had mentioned initializing the text query box to the same query as the last used facet mode. However, it turns out this is pretty confusing since the query mode uses all global constraints and therefore doesn't actually produce the same results in this case. So we should probably not initialize the text box, but I haven't taken out the feature yet.

The optional node tooltips point: The tooltips already showed the clusters for each node, so maybe I'm not understanding this one. I don't think we have any way to show the actual database events briefly enough for a tooltip. Or do you mean to have the list of events on the left change when hovering on a node (or map view marker, etc.)?

anoopsarkar commented 9 years ago

When I select one person in the Facet view, the Storyline view should include all the other people that share events with the selected person (i.e. the people listed along with the selection in the Facet view). Right now, it only shows the people selected from the People facet which limits the number of events in all views including the Storyline view. Also there seems to be some bug when I select two or three people and the Storyline view shows no events at all and the description list is also empty. Some selections in the Storyline view also create a view with null events.

I'm ok with taking out the text query feature. We do not need so much functionality, I think.

I was thinking that the tooltips should show the description so that the events themselves can be examined without selection of a node in the Storyline view.

theq629 commented 9 years ago

I've changed it to using co-occurring entities. I assume that when we find other entities that share events with the entities selected in the facet, that's all the selected entities (ie a conjunction check rather than a disjunction).

I think I've fixed the errors that were making the storyline show no events. It turned out that the cluster merging wasn't merging all the way, so fixing the errors also means more clusters are merged now.

I'm leaving the text query for now since it's useful for testing, and will remove it later if it's still confusing. I did remove the automatic initialization for it, though.

Also:

Known bugs:

anoopsarkar commented 9 years ago

I was thinking about the similarity of the Storyline view and a graphical view of all git branches in a repository. The only difference is that a line in git when it branches actually splits into two lines, while in our case the single line has to traverse along the y-axis. I wonder if we can exploit the git branch view (which looks quite clean) to our Storyline y-axis problem (that the lines jump around so much due to the avoidance heuristic).

anoopsarkar commented 9 years ago

This is what I am thinking of apropos my last comment:

https://github.com/sfu-natlang/lensingwikipedia/network

anoopsarkar commented 9 years ago

If I make one selection in the Person facet and go to Storyline, then come back to Person and make another selection (an additional Person constraint) and then go back to Storyline, then everything works as it should.

However, if I go to the Person facet and make two selections one after the other and then go to the Storyline view I get a "Loading .." prompt until I go back to the Person facet and make a third selection.

This is related to the second of you Known Bugs above, but a somewhat different way to getting it.

theq629 commented 9 years ago

Do you know if the github branch visualization is available to use? If there's anything like that which is open source then we could definitely use it.

theq629 commented 9 years ago

Also, I haven't been able to reproduce that way of getting the loading bug. Is it definitely reproducible for you?

anoopsarkar commented 9 years ago

I found these links to be the most useful, but perhaps we can simply use the nice layout and scrolling view of the github branch vis and use d3 to implement it?

anoopsarkar commented 9 years ago

I cannot reproduce the multiple selection bug either. Probably a by product of the "Loading .." error you have in the known bugs above.

theq629 commented 9 years ago

I went through the three Javascript projects I found through those links (historyview.js, commits-graph, gitgraph.js), and while the code isn't set up to make it too clear, I don't think any of them are doing any non-greedy graph layout. Also the git braches case doesn't have to deal with grouping entity lines into clusters (for git branch lines just merge into each other). So I don't think the git visualizations will help us much.

theq629 commented 9 years ago

I think we probably should switch to fixed horizontal slices as in Ogawa, though. The overlap avoidance issues with my current version can probably be improved, but although the fixed slices limit the layout a bit they should mean there is no need to explicitly avoid overlaps (because all y-position changes can easily be made to happen between nodes).

So my thought is to try:

For the hill climbing step we can either minimize the total amount of entity line movement, or we could try to put similar vis clusters (based on shared data clusters) close to each other. Probably the later is worth trying since it adds a little more information, but if it's too chaotic then we minimize line movement instead.

anoopsarkar commented 9 years ago

Can we create a small instructive example which has a 4-6 clusters at different times and about 5-8 entities. I am not entirely clear about how the y axis ordering for each cluster can be solved by hill climbing. Each storyline has the following representation, I assume:

And the objective of the hill climbing is to place each cluster at the right y axis location to avoid crossing between entity lines?

theq629 commented 9 years ago

Let's make sure we say either data clusters or vis clusters (merged data clusters) to avoid confusion. I think if we did global hill climbing we can't directly minimize crossings (or rather we could, but I'm not sure it's practical to compute the metric). But we could try minimizing the total change in y-distance for entity lines between vis clusters, that is if an entity line goes from vis cluster A to vis cluster B then it incurs the penalty of the y-distance between the two vis clusters. Or we could alternately minimize the distance between similar vis clusters where similarity is determined by shared data clusters.

So I don't think global hill climbing will do a great job, but it might be ok as a start. And if we do want to have fixed vis cluster y-positions then it's easy to switch that part out later.

I think this largely does depend on how we want to handle y-positions, though. The above is based on the idea of fixing y-positions for each vis cluster, something along the lines of the Waterloo algorithm (but not exactly since they do the more complex process). Ogawa doesn't do fixed y-positions but works them out greedily in increasing time order and does use discrete y-slices. My current algorithm does less-greedy relaxation and free positioning.

theq629 commented 9 years ago

For what I was saying a couple posts ago about wanting to avoid explicit overlap detection, the overlap detection right now is because we get cases like this:

If we draw entity line L with a straight line or spline from A to E, it will probably cross one of the boxes for vis clusters B, C, or D.

On the other hand we could have L go to y=30 directly after A, or extend it at y=10 to just before E and then change go down to y=30. But various issues with the current free-placement relaxation make that hard to do. So in retrospect I think that's one reason why Ogawa (and the other two papers IIRC) first divide the y-space into discrete slices and then assign lines to the slices (the other reason presumably being to allow discrete optimization algorithms).

So it's the slicing that I think we really need to add, and the question about fixed y-positions is whether anything (data clusters, vis clusters, events) is given a global y-position or if we compute them entirely per time slice.

(Sorry for the temporary issue close, I clicked the wrong button.)

theq629 commented 9 years ago

I've pushed the hillclimbing version and the demos are up: reference points as clusters and location as clusters.

The hillclimbing puts all visualization clusters with the same data clusters on a single row and goes through the rows moves each row to the position that maximizes this metric against the two adjacent rows:

(number of shared entities) + (number of entity line connections / max entity line connections)

I've also made some improvements to how lines and nodes are shown, especially dynamic sizing and a better way to draw lines (d3.svg.diagonal instead of d3.svg.line).

Known bugs:

anoopsarkar commented 9 years ago

Looks ok for Hannibal but completely breaks down for Julius Caesar. In the latter the lines go up and down even in very small time increments. Perhaps there should be a constraint that we move a person on the y-axis rather than move the line radically on the y-axis. However, lines can move up and down over longer time periods. It seems that writing it down in English, our objective is something like:

  1. Two people should be close on the y-axis if they share many event nodes (reference points or locations) especially in a sequence in small time periods. Their y-axis locations can and should drift over longer time periods.
  2. Two event nodes (referencepoints or locations) should be close on the y-axis if they contain many people in common. Their y-axis locations can drift if they are farther apart in time even if they have many people in common.
theq629 commented 9 years ago

Given that there are relatively few visualization clusters per year, I wonder if a simple possibility to approximate that is to follow up the global hill climbing with per-year hill climbing. We could use the same objective but compute it only for the entity line connections in and out of that year. That way the global hilclimbing hopefully provides the desired behaviour over long time periods while the local hillclimbing hopefully fixes the the behaviour over short time periods.

theq629 commented 9 years ago

Actually now that I think about it more, I wonder if instead of doing regular optimization we could do a greedy optimization something like this:

for each year y in order:
    pretend any entity line that doesn't have a node at y has a node with only that line
    for each node n in order:
        place n at row that maximizes \sum_{m -> n} metric(m, n)    

where m->n means there is some line from m to n, and metric(m, n) is a function of:

anoopsarkar commented 9 years ago

I don't understand the second line in your pseudo-code. By "row" I assume you mean the y-axis coordinate.

theq629 commented 9 years ago

I mean discretized y-axis coordinate. The global hillclimbing version has a row for each set of data clusters. In the greedy algorithm we'd probably add rows as needed and then assign specific y coordinates afterwards.

theq629 commented 9 years ago

I just had another idea which I think I should try first since it's a relatively small modification.

We define rows as in the current hillclimbing (one row per unique data cluster set) and do the hillclimbing to order them. This gives an initial assignment of vis clusters to rows, but to get the final assignments we:

anoopsarkar commented 9 years ago

One thing that is really annoying is occlusion of names with other names. This should be fixed and is probably independent of other issues.

Another issue is that there is no reason for a node to be placed at a row just because a few time steps later the entity is going to be participating in a node in a nearby row. See the Eisenhower line below. Notice how it plunges up and down before meeting Kennedy's line a little later. Those two nodes before meeting Kennedy could have been placed along a straight line for Eisenhower and the line should only curve up to meet the Kennedy line at the shared node.

image

theq629 commented 9 years ago

What I was calling rows is similar to what Ogawa et al. call slots, so I am going to switch to calling them slots. (The difference being that above I was assigning unique date cluster sets to slots, whereas Ogawa et al. assign individual entity lines.)

I'm working on a new version.

theq629 commented 9 years ago

There is a new version up now. If it seems promising then I'll describe it in more detail, but basically it's using a simplified greedy version of my initial mean position relaxation along with slots. Slots are now per-line as in Ogawa et al..

Lines jumping around to avoid multi-entity vis clusters are definitely a problem, but in many cases I'm not sure it's avoidable unless we want to try some way of showing a line going through a vis cluster box without being part of the vis cluster. (Ogawa et al. draw lines going through each other sometimes, although I'm not sure I find that aspect of their demos very comprehensible.)

anoopsarkar commented 9 years ago

Looks good and better than before.

  1. There are some sine waves being produced: image
  2. In each slot, there should be one name, else we get occlusion of names quite a lot.
theq629 commented 9 years ago

The current algorithm works as follows.

Let nodes be

  1. each vis cluster is a node; and
  2. for each year y with at least one vis cluster, add a filler node for each entity that is not in a vis cluster at y. That is, we pad out entity lines with filler nodes so that the layout can route entity lines around vis clusters.

Let entity lines be sequences of line points representing the order in which entities occur at nodes. There is one line point for each place an entity line is at a node.

We define priority as an ordering for placing things, so we can be greedy. The priority of an entity e is

\sum_{node n with e} (number of entities at e - 1)

The idea is that if an entity line interacts with other entity lines a lot then we want to give it a higher priority. Additionally, some entities can be designated important, and their priorities are doubled. For current purposes important entities are just those selected in the facet.

The priority of a node n is

\sum_{entity e at n} priority(e)

Note that since we don't normalize against the number of entities at n, nodes with more entities get higher priority; this is again because they will affect more entity lines.

I haven't tried changing the definition of priority yet, so my intuition could be totally wrong.

Slots are positions in a complete order for the y-axis, which we maintain as a list of slots. We will assign each line point to a slot, and each node to as many slots as needed for the line points on it. The drawing code will assign actual screen positions for each slot.

The algorithm is then:

def desired_slot_index(node n):
    centre := mean(slot index of previous line point on l for each line point l at n)
    round(centre - (num entities at n) / 2)

def update_year(year y):
    for each node n at y in decreasing order of priority:
        i* := desired_slot_index(n)
        if i* is undefined (ie all entity lines at n start at n):
            insert enough new slots at bottom to fit n
            i := first free slot index at the bottom
        else
            i := slot index with enough free slots at y for n and min |i - i*|
            if i is undefined (ie no space anywhere for n):
                add enough slots for n at which ever of the top or bottom is closest
                i := first index for new free slots
        assign n to slots at indices i, i+1, ..., i+(num entities at n)
        sort the line points at n by slot index of previous points on their entity lines
        assign each line point at n to its respective slot

for each year y in increasing order:
    update_year(y)
for each year y in decreasing order:
    update_year(y)

Basically we go through each year in order, at each year placing nodes and their line points in order of priority. Each node goes at the closest available position to the ideal position based on incoming entity lines. First we do that forward, and then backwards to straighten out any lines that wiggled around when going forward but in retrospect actually had space to go straight.

For example, with "Ptolemy" selected on a small data set, this is the result after the forward pass: greedy-layout-forwardonly

And the final result after both passes: greedy-layout-forwardbackward

Later note: after factoring out the general layout code into storylinelayout.js, what is called a vis cluster here is there called a node or input node. What is called a node here is there called a vis node. The term 'vis cluster' is still used in the calling code that generates the specific plot.

theq629 commented 9 years ago

An issue with the greedy algorithm is that low-priority lines sometimes jump around a lot. For example here is a case with "Hannibal" selected on the full data where a line for Lucius Aemilius Paullus would be better if its top part was moved down three slots. This isn't too easy to deal with since it means some heuristic method for readjusting greedily placed nodes. However, I'm not sure yet that it's really a significant problem. greedy-layout-jumping

theq629 commented 9 years ago

I'll look into the sine wave case. That doesn't seem to fall into the greediness issue that I mention above, so it might be something to do with how filler nodes interact.

As for name labels, right now they are fixed to the position of the first line point on their corresponding entity line. The case where they overlap horizontally is when one line takes over a slot from another, but I think vertical overlap from small slots is potentially also an issue. So I think maybe there should be a separate layout system (hopefully d3 has something) to position the labels, and then either we colour them to match the lines or we draw smaller lines from each label to its entity line.

anoopsarkar commented 9 years ago

Can we fix the name overlap issue asap? I want to show a demo of the Storyline view on Friday morning if possible. With the name overlap issue fixed I can easily show a demo.

One more thing: what happens when I click on a node? Should this not select the events in that node?Instead the Storyline view inexplicably changes to something else and the events selected are not what I expected.

theq629 commented 9 years ago

I think I can fix (or at least improve) the overlap issue in time for your Friday, I just need to finish an essay for school first.

For the selections, can you check if the behavior in manual query mode is what you expect? I think probably the storyline facet mode isn't currently excluding its own constraints from its data query.

theq629 commented 9 years ago

I've added label layout. It detects horizontal overlap and moves the later label to a different row in a greedy manner similar to the vis cluster layout. Very long labels that overlap with a lot of previous labels may be left overlapping (the alternative would be to shift them forward greedily and hope it doesn't make things worse). I also made labels colour coded and slightly smaller.

If I messed anything up and you want to get at the previous version, add '.old' to the end of the URL.

I think the node selection bug is what I said above, but since I'm out of time to properly test any change to it before your Friday morning I'll leave it for after the demo. Other known outstanding bugs are the sine wave bug and the not updating on facet changes bug.

anoopsarkar commented 9 years ago

Did you push the changes to the testing server? I don't see any difference in horizontal overlap. The names still occlude each other as I scroll to the right. Perhaps you can give me an example I can use for the demo that works for now.

theq629 commented 9 years ago

Yes, the testing server has it. For me it's working for any of the top selections of the person facet (for selections that produce a lot of lines the labels are still pretty close, but better than before). Are you sure your browser refreshed properly?