nlplab / brat

brat rapid annotation tool (brat) - for all your textual annotation needs
http://brat.nlplab.org
Other
1.82k stars 509 forks source link

Slow saving of annotation for large files #995

Open ned2 opened 11 years ago

ned2 commented 11 years ago

I'm finding that saving annotations for a file with about 600 lines is taking around 10 seconds to complete (on the latest version of Google Chrome and with a trivial annotation configuration). This seems like an excessively long period of time. Breaking the file up into 200 line chunks reduces the saving time drastically but I wouldn't have thought that 600 lines is really all that large and it would be preferable not to have to break the file up. Is it just that this is considered "large" for brat?

ghost commented 11 years ago

First of all, I am assuming that by "saving" you mean the loading of the page. If not, the rest of my reply will look a bit silly. '^^

Yes, I think you are correct in that 600 lines runs slow under the current version of the brat client. I am mainly responsible for the server back-end and haven't been poking around all that much with the client. There have been efforts and there are ideas on how to speed up the client rendering and I am sure that @amadanmath would be happy to fill you in on this and we could really use some more hands on the deck.

Also, @amadanmath, do we have a main issue for the client being slow? If so, reference it here and close this one and we can move the discussion there.

ned2 commented 11 years ago

What I mean is hitting 'OK' on the New Annotation dialogue. Then brat takes about 10 seconds or so before it becomes responsive and the new annotation is recorded on the span. If this involves a complete page refresh, then yep, this is the problem.

I'm presuming there's a reason why a complete page refresh is done rather than just updating the relevant bit of the DOM?

Happy to help out where I can, however I don't have a lot of experience in developing web apps, so may not be of much use.

amadanmath commented 11 years ago

You are correct, updating the relevant bit of the DOM was complex and we wanted to have it functional fast, so the whole SVG is refreshed. This is the primary cause of slowness. The problem is that a small change in the annotation (for example, drawing a new arc) can have rather large consequences on location of many objects. We did start rewriting the relevant bit of code, but it stalled for other responsibilities, and in the meantime the branch got very obsolete by some other rather pervasive changes, so it would probably have to be redone. It requires significant refactoring beforehand.

spyysalo commented 11 years ago

You might also try switching validation off by setting

Validation      validate:none

in tools.conf. I suspect there may be some O(N^2) part in the validation algorithm (N = annotation count).

kottmann commented 11 years ago

Thanks for the hint to turn off validation. We tested this for a while now and it improves things a bit, but the UI can still very unresponsive with a 350 line document.

ghost commented 11 years ago

@spyysalo: Not to gloat or anything, but didn't I fight you (and loose) in regards to having the validation turned off by default. '^^ I am adding this "hack" to our troubleshooting page.

reckart commented 11 years ago

We also experienced slowness as the document size grows and eventually decided to implement a pagination mechanism. We did some minor changes to the brat JavaScript. Since the pagination is mainly controlled from the server side (the state is maintained there), the changes may boil down to allowing the row number to start at an arbitrary index. Anyway, if you are interested, this is all available in the WebAnno project. The pagination has helped us to visualize and edit quite large documents with thousands of sentences using the brat visualization.

ghost commented 11 years ago

@reckart: I think that sounds like an excellent solution. Since we lack server state we could do the same thing but maintain the state in the URL as a sentence (perhaps even character) offset or similar.

ned2 commented 10 years ago

So I've been digging around a bit, and it seems like a significant part of the problem might be the extremely large number of elements within the SVG that renders the document. Apparently inline rendering of SVG starts to get rather slow with a large number of elements. I've read reports that "large" could be as few as 500 elements. For a 477 line document in brat, I'm counting 22033 SVG elements. So this sounds like a plausible candidate. I also noticed that this includes a large number of group elements (9968) for positioning the annotation spans (eg <g transform=....>) which are created even if there is no annotation for that line. Would it be possible to only create these elements when there actually are annotations for that line? This would drastically reduce the number of elements in the SVG.

I tried to see if I could make this change to test it out but struggled to work out how I would do this. Presumably somewhere in visualizer.js.

spyysalo commented 10 years ago

@ned2 : interesting observation. Looking for svg.group() calls in visualizer.js should identify all the points where groups are created. A potential challenge in revising this is that at the point where the groups are created, it might not be straightforward to determine whether they will end up empty or not. @amadanmath : suggestions?

xtsimpouris commented 10 years ago

Safely to assume that there is no progress on this one? I will start working on brat, but I will have large texts, and this will be a serious issue.

So, while I am an experienced web and python programmer, I find difficult to understand where and how to play with the code. Any ideas of where to start?

spyysalo commented 10 years ago

@xtsimpouris: great, we'd be very happy to welcome contributions on this!

Just to get started quick, I'd suggest to split your texts into smaller chunks. The brat tools include support for splitting and re-joining documents and annotations.

There are two broad alternatives for addressing this issue: optimize the implementation within the current architecture (several smaller issues), or switch to a more efficient architecture (a few bigger pieces of work). For the former, I could help set up profiling; for the latter, I could suggest some ideas for larger-scale revisions. If you can find the time to work on this, let me know which way you'd prefer to go!

xtsimpouris commented 10 years ago

@spyysalo: not exactly. Splitting beforehand may create an issue in case annotation is needed on that exact point.

So far, I have succeeded propagating an offset parameter from javascript to the _document_json_dict function and concluded to the following problems:

  1. Chopping the text within the _document_json_dict function while giving back the whole ann_obj object to the client creates the problem that javascript cannot handle annotation data on missing text. Even if it could (tried a little bit but lost in too many errors showed up), somehow it has to know that text is missing according to offset parameter
  2. Making the TextAnnotations offset-aware means that this has to work only when loading the annotation file for viewing purposes and not for editing/deleting. While I has overcome this problem for text annotations I have to check for all other kind of annotations (events, attributes, etc) for conflicts in case it can not find a specific T## - now it shows errors "Non-trigger '##' used by '##' as trigger". Still, javascript has to be offset-aware in order to recalculate things for drawing.
  3. Offset should be user defined. Configuration should support activation/deactivation, step and paging attributes. Step and paging should always be different to avoid problems of splitting the text in place annotator wants to annotate. Haven't checked this part yet and don't know how to.
  4. As far as I can understand, to avoid working on both client and server side. Maybe, we should make changes only client side. Meaning, server works as expected and client just skips things it shouldn't work with to optimize browser-ui.
  5. Somehow I have to add buttons for the annotator to do "steps" within ui. Haven't checked that yet and don't know how to.

Am I on the right track? :) Any ideas appreciated

spyysalo commented 10 years ago

Yes, I think 4) is a very good idea. It should be possible to speed things up considerably by implementing purely client-side support for viewing / editing just a part of a document at a time. Server-side support can come later if necessary.

UI controls can be added either to index.xhtml or created on the fly in visualizer_ui.js or annotator_ui.js.

@amadanmath : any suggestions on how to approach this?

amadanmath commented 10 years ago

I agree that it would be easier to do it in JavaScript-only. data would have to be split between the "canon" data (what we got from the server) and the rendering data (what we're going to paint on the screen); then, on render, you'd filter the former into the latter, moving all standoffs in the process (taking care to add the offset back on editing operations).

I would also recommend creating virtual "bookend" spans at equally virtual rows -1 and N + 1. They would not be rendered, but they could serve to display the arcs that connect things beyond the slice. That would take care of the missing span errors.

With this,

  1. No serverside changes needed. Paging is a clientside procedure, does not incur document loading penalty.
  2. The rendering data would contain all the spans, but some would be marked as virtual. Thus, no "missing T##".
  3. The step size and slice size would probably be best in the option dialog.
  4. With this idea, it's clientside-only.
  5. "steps" could be implemented on Up and Down keys, very similarly how document change is implemented on Left and Right (and one can, if desired, add the appropriate buttons in pretty much the same way).

Bad news: Separating render data and received data in visualizer.js is a fair bit of work, since data is used in a million places.

Good news: The animation branch has this mostly done, with receiving sourceData and transforming into data.

Bad news: That branch was abandoned fairly long time back, and is now rather unmergeable (also, non-functional half-baked animation-based rendering idea is also included, which should be weeded out), so only the main idea is actually worth taking.

xtsimpouris commented 10 years ago

OK then. Any ideas where to start? Also, where should I start reading for client-side options set by the user? (global parameter for slicing and offset)

amadanmath commented 10 years ago

@xtsimpouris: Working on it. I thought I could get it done today, but couldn't.

xtsimpouris commented 10 years ago

As optimization should be also the following idea, not necessarily to be implemented now, but to have it in mind. Rendering only from offset until end of viewable window. So then, no need for an extra parameter to specify "page length". Going "down" or "up" can use existing information to rerender the window without requesting again from server same data.

@amadanmath is there a way to help? I start reading a little bit the javascript and got * * seriously * * lost

amadanmath commented 10 years ago

Hmm... I have to go home now, but it's not quite yet debugged and working. If you want to hack on it during my night, feel free. Note the line "XXX DEBUG PURPOSES" - use it to easily set the page extent to check results. The numbers indicate sentences. Current bugs that I know of, that I couldn't find yet or didn't have time to plug so far:

jnieuviarts commented 10 years ago

Hi, how do you skip to next page/step ?

amadanmath commented 10 years ago

Shift-Up and Shift-Down, for the moment.

jnieuviarts commented 10 years ago

Thanks for the very quick answer. It works but when i use it, the page is refreshed and pagination is set back to 1st page. Is it a known bug ?

amadanmath commented 10 years ago

Can you please describe in more detail? I am not sure what you are referring to. If it refreshes the page and resets the pagination when you try to use it, I can't really imagine how "it works".

Some clarifications, just in case. There are unfortunately two related concepts that can be called "paging": the in-document paging that I just implemented in the paging branch, and the document autopaging that is kind of like a slide-show of different documents. An additional source of confusion is the fact that the Shift-Up and Shift-Down used to be assigned to autopaging, but this patch reassigns them to Shift-Left and Shift-Right to free up Shift-Up and Shift-Down for in-document paging. (Someone please think up some better names for these features...)

Also, at the time of this comment, this patch is not yet merged into the master branch; be sure you are on the paging branch (git checkout --track origin/paging), or none of this will work.

In case you are on the correct branch, could I please have you answer: What are the paging parameters in your Configuration in your case? What did you press to trigger the unwanted behaviour? Did the document change, or are you still on the same document? What was the first displayed line number before it happened, and after?

jnieuviarts commented 10 years ago

Yes, i do understand both type of paging. I pulled the paging branch to be sure but i still have the problem. My paging parameters are : size=30, step=20. When i want to change page, i press shift+up / shift+down and it works : the document changes. When i first load page, 1st line number is 1 and when i press shift+down, 1st line number is 21. Only problem is when tagging of a token is saved, i get kicked back to line 1. Thanks for help

amadanmath commented 10 years ago

Ah, I see, it resets on edit? Yes, it probably does - I did not try to edit with the new code in place. I'll have to think a bit about how to solve it.

(Sorry for the barrage of questions and explanations, I wanted to cover all the bases since I didn't know where the issue was.)

alexbrandsen commented 5 years ago

Hi, I'm running into the same issue, is the paging feature integrated in the master branch? I would expect it is, but can't find anything in the config file about this. Thanks!

amadanmath commented 5 years ago

It is in master, yes. There is no config; I believe it's in the Data menu.

alexbrandsen commented 5 years ago

Thanks for the quick reply!

Oddly though, I'm only getting the following options in my Data menu:

brat

amadanmath commented 5 years ago

Could be me misremembering. Maybe options menu then? (I can't check now)

alexbrandsen commented 5 years ago

That was the first place I looked, but no joy:

brat 1

amadanmath commented 5 years ago

Okay, I checked, now that I have slept and am sane :) It is in master, and you should definitely find it in Options. By your screenshot, I can only conclude you are not running the latest master branch. As you can see here, the option for paging is not conditional, and should show up in the Options menu:

https://github.com/nlplab/brat/blob/master/index.xhtml#L303

Goran

On Thu, Dec 13, 2018 at 2:34 AM Alex Brandsen notifications@github.com wrote:

That was the first place I looked, but no joy:

[image: brat 1] https://user-images.githubusercontent.com/29043839/49887350-645b5480-fe3c-11e8-858b-46335d32dace.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nlplab/brat/issues/995#issuecomment-446674507, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIMZCcybn-5PN7Ze99HUDyRkTj1YZTUks5u4T4SgaJpZM4AZ3pK .

alexbrandsen commented 5 years ago

Hi Goran,

apologies for the very late reply, I have been on holiday. Thank you for checking this, you're right, I must not be running the latest master branch (can't remember exactly how/when I installed it as it was a while ago). Will try updating and that should solve it. Thanks!

Cheers,

Alex.

amadanmath commented 5 years ago

Just note that there were several updates lately that might have broken things. master is now for Python 3; if you are still on Python 2, switch to the python2 branch instead. And yell if you notice bugs! :)

Goran

On Fri, Jan 4, 2019 at 12:23 AM Alex Brandsen notifications@github.com wrote:

Hi Goran,

apologies for the very late reply, I have been on holiday. Thank you for checking this, you're right, I must not be running the latest master branch (can't remember exactly how/when I installed it as it was a while ago). Will try updating and that should solve it. Thanks!

Cheers,

Alex.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nlplab/brat/issues/995#issuecomment-451175281, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIMZHkZcAC9sUB7RHsXBTzzSfCKVt4Gks5u_iBygaJpZM4AZ3pK .

alexbrandsen commented 5 years ago

I'm still on python 2 on this server, so grabbed the python2 branch and it works now! Thanks a lot for your help!