Text highlights for non-space separated tokens

spyysalo commented 13 years ago

In the current implementation, if a token has multiple annotations, all text between these annotations will be highlighted, and the highlight takes the color from just one of the annotations.

This is particularly noticeable in cases like "p53-mediated", where "p53" is annotated as an entity and "mediated" as an event trigger.

This is not high priority but does reflect on the perceived quality of the visualization.

ghost commented 13 years ago

This is now becoming a real issue with attempting to support Japanese. @amadanmath suggested server-side insertion of UTF-8 spaces into the text, I disagree with this decision due to this really being a presentation issue, thus not belonging server-side. Instead I find it more reasonable that this logic would reside on the client and that by looking at the textbound annotation boundaries we can resolve the issues for Japanese and non-Japanese. I hope we can have some sort of discussion regarding this in this issue.

https://picasaweb.google.com/pontus.stenetorp/BratScreenshots?authkey=Gv1sRgCKGfid3qwsK0Ag#5606075265076651378

amadanmath commented 13 years ago

As explained in a private chat, my solution of using zero-width spaces (0x200B) is not the solution for this issue, but rather a larger issue when dealing with Japanese (and any other space-less script): folding all annotations in sentence into a single chunk, which causes UFO-catcher annotations, and even more seriously, prevents line breaks (as the purpose of the chunks is to prevent the insertion of arbitrary spacing inside words). Example: one can annotate "suicide" as:

sui [Person] <--Theme-- cide [Action]

which does not warrant splitting up the word with a huge space or a newline just to allow for the insertion of the arc. Thus, UFO-catchers. Japanese (and Korean, and Chinese...) do not customarily write spaces; spaces can be inserted, but again not arbitrarily inside words. For instance, the very same argument above can be made for 自殺; we can insert spaces before and after it, but not between 自 and 殺, and expect one to preserve the illusion of reading something that is almost like the original text.

Now, I can't decide word breaks on the clientside. It is not a job for JS to determine what language the text is in, and we can't implement in JS a word segmentation algorithm for every language that might need it. Thus, I need word segmentation info from the server.

Zero-width spaces are a fairly standard solution, they will work even if the text is pasted somewhere else, and stripping them out is rather easy if we really need them out. Now inserting them is a bit bigger problem, but we don't need to implement it as part of brat. As I said, we can't solve this for all languages. We can make it into a specification for the input format.

All this is only incidentally related to this particular issue (since @ninjin misunderstood what exactly I was fixing with ZWS). This issue is also the consequence of having chunks. I seem to remember a long time ago having asked what colour the highlight should be if two spans overlap, and was told that one would have priority, and colour both spans' extent in one colour. The current behaviour is (admittedly, sub-par) abstraction of that solution, which does not check whether they overlap, but just if they're in a same chunk.

spyysalo commented 13 years ago

It seems to me the Japanese-related issues were addressed without resolving the underlying bit about multiple annotations for a token. I'm reassigning this to the visualizer rewrite.

amadanmath commented 13 years ago

Back to the issue at hand. I have removed the logic that makes the highlight the largest unbroken region covering all the spans in a chunk. However, I expect some issues from overlapping transparent regions. Lacking a specification, I'll leave it like this. Please someone check whether this is satisfactory.

nlplab / brat

Text highlights for non-space separated tokens #52