inception-project / inception

INCEpTION provides a semantic annotation platform offering intelligent annotation assistance and knowledge management.
https://inception-project.github.io
Apache License 2.0
589 stars 149 forks source link

Index out of range using WebAnno 3.2 with leading space #2060

Closed david-waterworth closed 3 years ago

david-waterworth commented 3 years ago

I'm trying to represent multiple fields in the annotation interface. My current attempt involves creating two NER layers, one which is pre-annotated with spans representing each field and the second for the actual annotation. The actual text is the concatenation of the fields i.e. "field1field2field3"

I found that that the interface only inserts whitespace between annotated spans if it has to (i.e. the label is longer than the span) so I inserted space between the fields i.e. "field1 field2 field3". I tried tabs (\t) but that didn't seem to work.

The problem I'm running into is I get index out of range or the text is displayed incorrectly when field1 is NULL (i.e. the text contains a leading space). For example the file below errors due to sentence 2. If I remove it there's no issue. The error is "Error: String index out of range: 19" - it seems to be trimming the #Text which messes up the offsets.

Is there a more robust file format you recommend which is easy to use from Python? I had a brief look at some of the others but they seem to have a steep learning curve. I have text along with tokens and offsets from a huggingface tokeniser.

FORMAT=WebAnno TSV 3.2

Text=ActFlow analog-value

1-1 0-3 Act 1-2 3-7 Flow 1-3 8-14 analog 1-4 14-15 - 1-5 15-20 value

Text= ActFlow analog-value

2-1 1-4 Act 2-2 4-8 Flow 2-3 9-15 analog 2-4 15-16 - 2-5 16-21 value

david-waterworth commented 3 years ago

Also the way the token offsets are processed seems odd, the fragment below works fine

Text=\~ActFlow\~analog-value

1-1 0-1 \~ 1-2 1-4 Act 1-3 4-8 Flow 1-4 8-9 \~ 1-5 9-15 analog 1-6 15-16 - 1-7 16-21 value

But if I remove the lines corresponding to the ~ character the text is displayed as "\~ActFlow~analog-valu" and the tokens aren't correctly aligned with the text (i.e. the first token is "\~Ac", the second "tFlo" etc.)

Text=\~ActFlow\~analog-value

1-1 1-4 Act 1-2 4-8 Flow 1-3 9-15 analog 1-4 15-16 - 1-5 16-21 value

reckart commented 3 years ago

Off the top of my head, I believe your data should fit these rules, so I suspect there might be a bug in the case that the text starts with a space character. Need to check that...

These rules apply independently of the file format. The other Python-supported file format that INCEpTION supports is UIMA CAS XMI. You can work with these files using the DKPro Cassis library. XMI is the format we usually recommend.

david-waterworth commented 3 years ago

Thanks, UIMA CAS looked like it had a pretty steep learning curve but I managed to get it working easier than I expected. I've ended up creating a custom span layer containing read-only annotations for each field rather than trying to use some sort of seperator. It works fairly well although it would be nice if I could better control the whitespace (i.e. using some sort of tab alignment) - i.e. have a space between the 2nd and third grey tokens in the image below, and align the start of the text from each vertically.

image

Are there any plans to relax the bulleted requirements? It seems to imply that the tokenisation is destructive if there are leading/trailing spaces, the offsets aren't into the original string, they're into a trimmed string?

reckart commented 3 years ago

The tokenization is not destructive. The idea is just that the begin/end offsets of tokens and sentences are places such that they point to positions where there are characters in the original text and not to positions where the original text contains spaces.

I believe the issues are mainly due to the brat-based visualization and could probably be relaxed/fixed. But that said, having tokens start/end with whitespace can also lead to other oddities. E.g. if you consider recommenders, they might learn that _BLAH (_ is supposed to represent a space character here) and BLAH are different things while it is more likely that it is just a bad tokenization. So I'd be hesitant about looking into relaxing these requirements as I think they make a lot of sense.

david-waterworth commented 3 years ago

Yeah non-destructive is probably the wrong word, I agree that it's not ideal that tokens start/end with whitespace and don't think that should be changed. I think though where things could be relaxed is that sentences must start/end on token boundaries but I'm not really sure what the impact of that would be. Ideally (for me) the UI would display the sentence verbatim (including control characters like tab) but only allow annotations to be anchored to tokens, I don't really see a need for the first and last displayed character to be the first character of the first token and the last character of the last token? Huggingface's Bert tokenizer for example strips the leading/trailing spaces but returns offsets into the original text. So I need to add a post-processor that trims the input text to align with the start offset of the first token and the end offset of the last and then realign the tokens (or pre-trim the strings but I'd prefer something that works regardless of what the tokeniser considers a token and what is non-textual).

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
encoding = tokenizer.backend_tokenizer.encode(' test ', add_special_tokens=False)
for token, offsets in zip(encoding.tokens, encoding.offsets):
    print(f"{token}\t{offsets}")

test (1, 5)

But at a minimum I there's an issue that these rules aren't enforced at import. I had cases where I could import an invalid TSV file, annotate it and re-export it, the token start/end offsets were modified (incorrectly). In other cases I got the index out of range exception which is better I guess.

reckart commented 3 years ago

Validating the data more thoroughly during import is a good idea.

reckart commented 3 years ago

Ideally (for me) the UI would display the sentence verbatim (including control characters like tab) but only allow annotations to be anchored to tokens, I don't really see a need for the first and last displayed character to be the first character of the first token and the last character of the last token?

You can switch the UI from "brat (sentence-oriented)" to "brat (line-oriented)". That should then also display leading whitespace in front of sentences. But the sentence itself should still use the first token character as its start position. Note that tab characters, depending on your environment, they are interpreted to have different widths. So what looks aligned in your text document may not be aligned anymore in the UI. Also the tabs in the brat visualizations are interpreted with a fixed width and not up to a particular tab stop point.

Huggingface's Bert tokenizer for example strips the leading/trailing spaces but returns offsets into the original text

I didn't quite get that. If you tokenize ' test ' and you get back test (1, 5), that looks ok, no? I think we have no problem with that - at least we should not - and if we do it would seem to be a bug. You just need to make sure that the sentence enclosing these tokens snaps to the begins/ends of the tokens inside it.

david-waterworth commented 3 years ago

I didn't quite get that. If you tokenize ' test ' and you get back test (1, 5), that looks ok, no? I think we have no problem with that - at least we should not - and if we do it would seem to be a bug. You just need to make sure that the sentence enclosing these tokens snaps to the begins/ends of the tokens inside it.

I might be misunderstanding you, but it doesn't seem to be supported for TSV 3.2 since the #Text field cannot start/end with whitespace and the tokens must be aligned with the #Text, so in the example below I need to retokenise to (0,4) to align with "Test" or it messes up the rendering, showing "Tes"

FORMAT=WebAnno TSV 3.2

Text= Test

1-1 1-5 Test

image

It seems to be fine using XMI, since you can specify the sentence to be aligned with the first/last token (as I think you're saying) as follows.

cas.sofa_string = " Test " cas.add_annotations([Token(begin=1, end=5)]) cas.add_annotations([Sentance(begin=1, end=5)])

This doesn't display the whitespace in "brat (sentence-oriented)" but does in "brat (line-oriented)" as you mentioned earlier. Both cases show the correct text.

I guess this is predominantly a limitation of TSV to which the workaround is to use XMI? I actually found it quite hard to find a good example of how to generate an XMI from scratch using Python, it wasn't really all that hard once I figured it out for myself though.

reckart commented 3 years ago

I think it is a bug in the TSV code that needs to be investigated.

@jcklie Is there a recommended example for how to create an XMI from scratch using Cassis?

david-waterworth commented 3 years ago

It doesn't have to be complicated, the main confusion for me came from initially from what is TypeSystem.xml and where do I get the correct one from, then what are the type names for the Token and Sentance types. Then the mechanics of adding the text, token and sentence annotations was fairly simple i.e.

import pandas as pd
from cassis import *

with open('TypeSystem.xml', 'rb') as f:
    typesystem = load_typesystem(f)

cas = Cas(typesystem=typesystem)
cas.sofa_string = " Test "

Token = typesystem.get_type('de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token')
Sentance = typesystem.get_type('de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence')

cas.add_annotations([Token(begin=1, end=5)])
cas.add_annotations([Sentance(begin=1, end=5)])

cas.to_xmi("test.xmi")
jcklie commented 3 years ago

@reckart The cassis doku has this. If you want something more INCEpTION-specific, then we would need to write a colab and link it on the INCEpTION example page.

jcklie commented 3 years ago

ariadne also has e.g. this which can be adapted.

david-waterworth commented 3 years ago

I based my example of the cassis doc but my confusion was it both loaded a typesystem file which I didn't have, and it added annotation to an existing document.

reckart commented 3 years ago

So having had a look at this, it seems the problem is that the code expects that the first character of the #Text=XXX line coincides with the first character of the first token.

The #Text lines do not come with offset information. The begin offset of the start of #Text is inferred from the first token. While for the very first #Text line in a document, we could assume to know that its offset is 0, for the second sentence, we start running into potential ambiguities.

So in that respect, we have a problem e.g. with

#Text= ActFlow analog-value
2-1 1-4 Act _ _

because suddenly the assumption that the begin offset of the first token points to the start of the #Text is wrong - it points to the second character in the #Text.

So I believe the documentation needs to be extended to say that The begin offset of the first token in a sentence must point to the point where the string covered by the#Textlines start in the original document.

The other question is: is there a way to make the format a bit more robust to your situation e.g. by trimming the #Text lines at start. But that would also only work for the first #Text line of a sentence. If a sentence contains a line break, then it is split over multiple #Text lines and the token offsets need to point exactly into the string that can be concatenated from these lines:

#Text=A
#Text= B
1-1 0-1 A _ _
1-2 3-4 B _ _

So in this case, if we trimmed the second #Text line, it would break the offsets of the B token.

So... I think it is something that must be document but which cannot be properly fixed without introducing explicit offset information for #Text lines.