nlplab / brat

brat rapid annotation tool (brat) - for all your textual annotation needs
http://brat.nlplab.org
Other
1.79k stars 511 forks source link

Discontinuous spans #362

Closed spyysalo closed 12 years ago

spyysalo commented 12 years ago

Implement support for discontinuous spans, for marking e.g. "alpha[...]actin" in "alpha and beta actin" as a single entity mention.

This will require some modifications to most parts of the system, starting from the standoff representation.

No milestone or assignment yet, opening this for discussion.

ghost commented 12 years ago

@spyysalo: Do we have a user requesting this?

spyysalo commented 12 years ago

Yes, me.

spyysalo commented 12 years ago

Suggestion for format extension:

Current is like

T5      Protein 305 319 interleukin-10

this could be extended into

    T5      Protein 305 316 350 352 interleukin 10

or alternatively something like

    T5      Protein 305 316 350 352 interleukin[...]10

where [...] or a comparable "special" string marks the gap in the reference text.

spyysalo commented 12 years ago

No way this will make the demo, too much risk of breakage and too little time to make sure everything's OK. Moving to later milestone.

spyysalo commented 12 years ago

On reflection, the format

T5      Protein 305 316 350 352 interleukin 10

is quite confusing. Perhaps something like

T5      Protein 305 316;350 352 interleukin 10

instead?

... or maybe

T5      Protein 305 316; 350 352 interleukin 10
ghost commented 12 years ago

I am in favour of:

T5      Protein 305 316;350 352 interleukin 10

In my opinion it is more sensible since `and;` have distinct semantic meanings.

Still, wish we had a tab after the type. Would be awesome:

>>> [(int(s), int(e), ) for s, e in  [p.split(' ')
            for p in 'T5\tProtein\t305 316;350 352\tinterleukin 10'.split('\t')[2].split(';')]]
[(305, 316), (350, 352)]
>>> 
spyysalo commented 12 years ago

Work on this has stalled a bit, but in the last commit (fe6dba05cc3fb316e94c1f351e4b75b7cd6e29bc) the serveside for reading and sending discontinuous annotations should basically work.

@amadanmath: would you perhaps have some time to look into the client-side (primarily visualizations) for these at some point?

amadanmath commented 12 years ago

Sure, if you'll give me some protocol details, and an annotation file that exhibits it.

spyysalo commented 12 years ago

Great! Here's the minimal test files I've been using:

test1.txt:

Alpha- and beta-actin were studied.

test1.ann:

T1  Protein 0 6;16 21   Alpha-actin
T2  Protein 11 21   beta-actin

the protocol is a simple extension of the previous one; just see fe6dba05cc3fb316e94c1f351e4b75b7cd6e29bc: instead of passing [start, end], the server provides [[start1, end1], [start2, end2], ...].

The current hack for visualization is to just use the first start and last end for the start and end points of a normal span (from client/src/visualizer.js):

    var start = spans[0][0];
    var end   = spans[spans.length-1][1];
    var span =
              //      (id,        type,      from,  to,   generalType)
              new Span(entity[0], entity[1], start, end , 'entity');

As we briefly discussed before, discontinuous span annotations should preferably be visualized as a set of separate span boxes, connected by some type of visually "special" connector (e.g. dashed black line connecting the lower corners). I'm guessing there will be some trickery necessary to get outgoing connectors to look nice, but to get started, it would be great if the basic visualization could be tweaked to show them as discontinuous.

amadanmath commented 12 years ago

Okay, I've been mulling over this for the past couple of days, and I am unsure about several things.

and given arcs:

verb-1 noun
verb-2 noun
verb-3 noun

where do the arcs go? For verb-1 and verb-3, one idea is to go from the closest noun-part; but then it is not obvious that they connect to the same thing.

Awesome ASCII drawing:

        _______________      _______________
       /        _______(noun)_)_____        \
      /        /             /      \        \
(verb)       (#)       (verb)       (#)       (verb)
verb-1   noun-part-1   verb-2   noun-part-2   verb-3

i.e. we insert each noun-part as a separate span, connect them to a dummy noun span (which is zero-width, and set at center point of all parts), then connect verb spans to that dummy noun span. We do this only for spans that have more than one offset pair. Also, try to make it so that all of the noun highlights together. How's that?

I don't want to connect the lower corners, since it would either look ugly, or would necessitate that those arcs go under the boxes (which would be a hell to do). But under this example, we can connect the tops of the subspans to middle of the superspan, which should look okay, and we get it almost for free (since that's exactly what we do with other arcs as well).

spyysalo commented 12 years ago

Point by point:

I see your hack code only for entities, not for any other types.

It's my understanding that the part of the protocol passing "entities" is the only one containing references to text spans (events refer to spans only though the "entity" that is their trigger). If there are other references, these should be modified also.

server crash

Thanks for noting this. I think I may have squashed this bug now, please pull.

Where would arcs go from

Good question. To me, the obvious choices would be either the nearest span (e.g. tokenwise, with spans to the right "winning" in ties) or always a specific one (e.g. the rightmost, which would nicely resolve to the head for base NPs in English, a common case). I think either of these should be OK, although there may be some cases I can't think of right now that would clearly resolve this in favor of one or the other.

Your suggestion (separate extra span) hadn't occurred to me, and I'm not convinced this will look good (or easily understandable) in many cases, but let's consider this as one alternative also. My worry is that this introduces a lot of extra elements (one span + as many arcs as there are partial spans) and would look complicated even for the basic cases, potentially becoming a mess in complex ones.

I don't want to connect the lower corners, since it would either look ugly, or would necessitate that those arcs go under the boxes

I don't quite see the issue. My idea was that the arcs would go something like this in the simple case (my mockup is admittedly a bit ugly, but I don't think this layout needs to be):

and would go around the boxes like any other arc in cases where there's stuff in between. What's the ugly-or-under case?

amadanmath commented 12 years ago

It's my understanding that the part of the protocol passing "entities" is the only one containing references to text spans (events refer to spans only though the "entity" that is their trigger). If there are other references, these should be modified also.

Triggers (.triggers) are separate from entities (.entities); and then there's equivs (.equivs). All of these define offsets in the protocol.

I will have to completely re-work the way this is done on the clientside anyway, so don't worry about that; but I'll have to tell me equivs and triggers can't be discontinuous (which seems like dangerous ground), or change the serverside to comply.

I think I may have squashed this bug now, please pull.

I can't see a commit. Have you pushed?

the nearest span

As I said, that's potentially very confusing, since an arc to the left and an arc to the right would attach to different elements. I'm fine about the "always the rightmost" rule (possibly making it configurable to allow leftmost).

potentially becoming a mess in complex ones.

Yeah, I hadn't considered the tower case (distributional meaning, with a tower of annotations of a same span stacking on top of each other). In this case the subspans should only be shown once. Still, the "rightmost" solution is probably simpler to implement, and I can't see a downside. I was basically just asking for clarification/specification, and throwing out ideas.

I don't want to connect the lower corners, since it would either look ugly, or would necessitate that those arcs go under the boxes

I don't quite see the issue.

The problem comes up when you consider annotating "and" as well. But I think I actually may have a solution. Right now we're doing registrations for the span space, which allows us to nicely "pack" span boxes. If we reserve all the space between them, the "and" annotation in this example would be pushed to the next row up. It would also help legibility (since there would be no intervening elements in the same row).

spyysalo commented 12 years ago

Equivs define offsets? What madness is this? (... I'll have a look.)

Still, the "rightmost" solution is probably simpler to implement, and I can't see a downside.

OK, this is fine by me, I don't expect we'll be getting any complaints about this.

I like the suggestion of reserving the intervening space; this should work nicely in common cases.

spyysalo commented 12 years ago

I'm not seeing offsets in either events or equivs in the JSON, which is how I remembered the protocol. Do you intend some other part of the system? In any case, both "normal" entity and trigger spans should be allowed to be discontinuous (i.e. multiple spans), tokens and sentences can be assumed to be continuous, and I don't think any other part of the representation should refer to text offsets.

amadanmath commented 12 years ago

I'm not seeing offsets in either events or equivs

I didn't say events, I said triggers.

"triggers": [["T11", "Methylation", 540, 541]]

Can you do the serverside magic?

You're right about equivs, I misremembered - they do just list the IDs. Sorry about that.

spyysalo commented 12 years ago

Ah, sorry, I forgot those were in separate dicts. It's already done, though; just try the following:

.txt

Alpha- and beta-actin were studied.

.ann

T1  Binding 27 30;31 33 stuie
E1  Binding:T1

(the server sends them as intended already:

triggers: [[T1, Binding, [[27, 30], [31, 33]]]]

the client seems to crash, though.

amadanmath commented 12 years ago

I can't check; the server is still crashing where I am, and you didn't respond to my inquiry if you pushed. The last commit on https://github.com/nlplab/brat/commits/discont branch is Apr 14. Maybe I'm looking in a wrong place?

The client would be crashing because, as I said before, the hack is only implemented for .entities, not for .triggers - so they still expect two numbers, not an array of pairs. Don't worry about the client now, just help me get the server to work :p

spyysalo commented 12 years ago

No, that's where I at least thought I was pushing to, just a sec ... OK, I think I'm still connected to the old repo. Fixing.

spyysalo commented 12 years ago

I think I managed to push to the new repo now. Sorry about that, my bad.

amadanmath commented 12 years ago

Ugh. Sorry, this is slow going. The changes are extensive, some places are still not implemented, and everything fell apart and now I have to do a bughunt again. It will take some more time :(

amadanmath commented 12 years ago

While I mess around with the visualiser, please take a moment to think about what it will mean for the annotator. What should be sent for createSpan action? What should be sent when an already discontinuous span is edited, and e.g. its type changed? What should happen if someone hits reselect on a discontinuous span?

One more thing: what should be the text representation of the discontinuous span? For now, I'll just space-separate the fragments; but "Alpha-" + "actin" gives "Alpha- actin", which is a) not nice, 2) doesn't work as well as a search term, π) doesn't work for CJK.

spyysalo commented 12 years ago

@amadanmath: You'll be pleased to know I've already thought extensively about this :-)

What should be sent for createSpan action?

Same as always. As continuous spans are expected to be by far more common than discontinuous ones, I'm proposing to keep the continuous span annotation as is and to make creating a discontinuous one a two-step process: first create a continuous span, and then either add to it (union) or cut away from it (subtraction), functions accessible from the span menu.

What should be sent when an already discontinuous span is edited, and e.g. its type changed?

This one is easy: a single span (even if discontinuous) only has one type, and this changes. Same goes for most other modifications (change attributes etc.). The parts of a discontinuous spans are in no way independent annotations, and any editing functionality that doesn't directly involve the parts (i.e. the set of character offsets belonging to the span) should not be affected by the removal of the assumption that the characters belonging to an annotation for a continuous span.

What should happen if someone hits reselect on a discontinuous span?

Same as create, i.e. you initially get a new continuous span, which you are then free to modify again to get back to a discontinuous span. (This is expected to be much rarer yet than creating one in the first place, so I see no reason to optimize the number of clicks here.)

amadanmath commented 12 years ago

Obvious what to do with types. I was talking about what to do with the protocol around start and end, since that can't continue. You guys need to add server-side support for discontinuous createSpan, probably by changing start and end into offsets just like you did in getDocument, because at this point I can only create and edit continuous spans, even though the discontinuous can be displayed. Also, see the update on my previous comment (sorry).

spyysalo commented 12 years ago

OK, you meant the API level. Gotcha. Yep, I can do the analogous changes to the serverside, just let me know when you need them.

amadanmath commented 12 years ago

Okay, I pushed what I did so far into discont branch. There's some weirdness with chunk spacing that I haven't found yet, but most of the things should work, I think (aside from, obviously, editing discontinuous branches). I haven't done the connecting line yet, either, but I think that should be a cakewalk in comparison. Give it a whirl, and yell if you find any more bugs. (You probably will. :p )

spyysalo commented 12 years ago

Closing, basic support is now done (discont branch) and we have more specific issues for what remains.