sfu-natlang / lensingwikipedia

Lensing Wikipedia is an interface to visually browse through human history as represented in Wikipedia. This the source code that runs the website:
http://lensingwikipedia.cs.sfu.ca
Other
11 stars 4 forks source link

serious bug: backend crashes on large queries #133

Closed anoopsarkar closed 9 years ago

anoopsarkar commented 9 years ago

any query that is about 2000 events is causing a backend crash right now.

e.g. select any instance in the Category facet that is larger than 2000 events. the backend will crash.

this is almost certainly due to the new code pushed to the live site. as of now the cause is unknown.

theq629 commented 9 years ago

Where does the crash show up? I haven't tried my own server running with the merged code yet, but on the live site frontend I don't see any sign of a backend crash (there seems to be an error from the frontend).

anoopsarkar commented 9 years ago

Here is how to reproduce the backend crash in the smallest number of steps:

  1. Select Popes [1808] from Category facet in Facet view
  2. Go to Storyline view and click on "Category facet"
  3. Click on Clear all constraints.

The backend crashes.

We should probably remove the Category facet from Storyline as this facet seems to have elements that appear across large stretches of time and they are tightly intertwined.

You can also crash the backend by selecting a large number of points in the Cluster view.

(I should say please do run a local server -- we should add instructions on how to do this in website.md so that the live site does not crash)

theq629 commented 9 years ago

Ok, I can reproduce it now. It seems likely it's a case where the backend has to touch too many database entries, so I'll see if I can improve the whoosh query.

theq629 commented 9 years ago

The problem turn turns out to be that the categories tend to strongly overlap with each other. The first step in handling the query at the backend is to identify co-occuring entities (since each should be a line in the visualization), and the second step runs another whoosh query to find all the relevant data clusters. In the pope case, there are about 6000 co-occurring categories, which makes the second query extremely slow.

Since 6000 lines is also too many to visualize, I think probably the best solution is to have the backend only return the most frequent n entities, which also places a hard limit on the number of lines in the visualization. Does that sound ok?

anoopsarkar commented 9 years ago

yes, that does sound reasonable. I guess we can put it in the settings file with other parameters. On Jan 10, 2015 9:04 PM, "Max Whitney" notifications@github.com wrote:

The problem turn turns out to be that the categories tend to strongly overlap with each other. The first step in handling the query at the backend is to identify co-occuring entities (since each should be a line in the visualization), and the second step runs another whoosh query to find all the relevant data clusters. In the pope case, there are about 6000 co-occurring categories, which makes the second query extremely slow.

Since 6000 lines is also too many to visualize, I think probably the best solution is to have the backend only return the most frequent n entities, which also places a hard limit on the number of lines in the visualization. Does that sound ok?

— Reply to this email directly or view it on GitHub https://github.com/sfu-natlang/lensingwikipedia/issues/133#issuecomment-69483879 .

theq629 commented 9 years ago

The xkcdplottimelinescleaned branch and http://champ.cs.sfu.ca/WikiHistory/latest/whoosh/wikipediahistory have the limit now.

I'm finding that the category facet is possibly useless for the storyline, though. The queries are still slow and results are too tangled to see much. So I think maybe we should add a domain settings option to exclude facets from the storyline, and then exclude category.

anoopsarkar commented 9 years ago

yes, I also suggested removing Categories. should we put the limit as an extra selector in the interface?

On Tue, Jan 13, 2015 at 7:31 AM, Max Whitney notifications@github.com wrote:

The xkcdplottimelinescleaned branch and http://champ.cs.sfu.ca/WikiHistory/latest/whoosh/wikipediahistory have the limit now.

I'm finding that the category facet is possibly useless for the storyline, though. The queries are still slow and results are too tangled to see much. So I think maybe we should add a domain settings option to exclude facets from the storyline, and then exclude category.

— Reply to this email directly or view it on GitHub https://github.com/sfu-natlang/lensingwikipedia/issues/133#issuecomment-69761815 .

anoopsarkar commented 9 years ago

what is the parameter value set to now? it might be useful to include as a selector.

we should push the limit as a hotfix to master immediately as I notice more people are visiting the site and it has been crashing of late. this might have to do with a talk announcement as I am planning to talk about lensingwikipedia at UC Boulder in February, so we may have a spike of visits going on right now and soon after the talk.

theq629 commented 9 years ago

Oh right, that's what you said in the first place. In that case, I've pushed the facet exclusion as well. The default is for wikipediahistory to use person and organization, while avherald only uses organization.

The two new paramaters in the backend settings are plottimeline_max_cooccurring_entities and storylineUseFacets.

The co-occurring entity limit defaults to 10 right now, but it can be changed per server. The reason that I didn't include an interface selector as well is that we would also want the backend hard limit, to avoid the current backend freezing situation. Right now we don't have any way to communicate backend settings to the frontend, so it wouldn't be clear how large the value could be before the backend starts ignoring it.

theq629 commented 9 years ago

I've made the backend report both the total number of co-occurring entities and the number it actually returns (after the cutoff). On my demos I've added a frontend status text that indicates how many co-occurring entities are included and whether they are clipped or not ("top" or "all").

So if you do want a frontend interface selector, the frontend can now indicate when the selected number is getting ignored due to the backend cutoff. We just can't enforce the cutoff while the user is changing the selector, without a bunch more work.

(The status text may show more co-occurring entities than are visible; that's probably the single-cluster line issue already reported in #111.)

theq629 commented 9 years ago

The changes for single-cluster lines discussed in #111 do seem to fix the miscounting. I also had to make some other fixes to the co-occurrence cutoff, which don't effect the main purpose.

anoopsarkar commented 9 years ago

I've asked @KonceptGeek (Jasneet) to limit size of queries in the Cluster view as well. Users should not be able to select large chunks of the data. Once he puts in a pull request for that we can close this issue.

anoopsarkar commented 9 years ago

Also, Max's hotfix for large Storyline requests to the backend is on the live site now.