proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

Random results from foliavalidator and folia2txt #100

Open kosloot opened 3 years ago

kosloot commented 3 years ago

given this file: buggy.txt

foliavalidator and folia2text give random outcome:

> foliavalidator buggy.txt 
Validated successfully: buggy.txt
> foliavalidator buggy.txt 
VALIDATION ERROR on full parse by library (stage 2/3), in buggy.txt
ParseError: FoLiA exception in handling of <p> @ line 35 (in parent <text> @ parent line 34) : [InconsistentText] Text for <Sentence at 139872518657136 id=KB.DDD17C.NicolineCrowd.73272.p.1.s.5 set=None class=None>, is inconsistent: EXPECTED (deep text after normalization) *****>
Den Primo Vizir heeft aen de Bassa, die in Ongaryen zijn, orde gesonden, om een Corps van de beste Troupen in Buda te brengen, en al de , welcke in die Stadt leggen daer uyt te trecken.
****> BUT FOUND (strict text after normalization) ****>
Den Primo Vizir heeft aen de Bassa, die in Ongaryen zijn, orde gesonden, om een Corps van de beste Troupen in Buda te brengen, en al de nieugeworvene, welcke in die Stadt leggen daer uyt te trecken.
******* DEVIATION POINT:  en al de <*HERE*>nieugeworv
(also checked against older rules prior to FoLiA v2.4.1)
kosloot commented 3 years ago

I can reproduce with a much smaller file too: buggy2.txt

proycon commented 3 years ago

I'm not sure whether the example file is a real-world example or just an example to illustrate this issue, but I see two problems:

  1. correction KB.DDD17C.NicolineCrowd.73272.p.1.s.1.correction.1 corrects only the N. It would make more sense to correct all the previous words. In the current way it is expressed, these words are not removed! (but they have no text representation in the TICCL class). If I fix this I get https://download.anaproy.nl/buggy3.xml which does pass the validation.
  2. Though allowed, it is a bit confusing to use the default current class when it is not in fact the 'current' most authoritative text representation. As I argued before, I'd recommend to make the ticcl class the current one (assuming this is the text representation that is deemed most current/correct and should be used as a basis for further processing), and assign something like an OCR or original class to the other text representation.
proycon commented 3 years ago

I'm still a bit puzzled what caused the actual randomness though..

proycon commented 3 years ago

Actually, I think I the randomness may have been caused by the unordered map from which text classes were read, that order was not guaranteed and behaviour differed based on which was checked first..

kosloot commented 3 years ago

Ok, buggy2.txt was just a made-up excerpt. And the use of class-names is questionable, but still: I think it is valid FoLiA, and foliavalidator should be deterministic

proycon commented 3 years ago

Yes, the non-deterministic quirk is definitely not good. I did commit a fix for that.

kosloot commented 3 years ago

But the fix isn't released, so I wonder if that helps @martinreynaert using LaMachine?

proycon commented 3 years ago

released now

kosloot commented 3 years ago

I am quite sure this is NOT solved completely yet. Given this file: incorrect.txt I really think foliavalidator is wrong here:

foliavalidator incorrect.txt VALIDATION ERROR on full parse by library (stage 2/3), in incorrect.txt ParseError: FoLiA exception in handling of

@ line 35 (in parent @ parent line 34) : [InconsistentText] Text for <Sentence at 139728228001272 id=KB.DDD17C.NicolineCrowd.73272.p.1.s.3 set=None class=None>, is inconsistent: EXPECTED (deep text after normalization) *> Malvasie , > BUT FOUND (strict text after normalization) > Malvasie komende, *** DEVIATION POINT: Malvasie <HERE>komende, (also checked against older rules prior to FoLiA v2.4.1)

kosloot commented 3 years ago

Addition: I tried to rename the <new> node to <current>. Which I thought might solve the problem, but then:

> foliavalidator notsilly.xml 
VALIDATION ERROR on full parse by library (stage 2/3), in notsilly.xml

ParseError: FoLiA exception in handling of <correction> @ line 43 (in parent <s> @ parent line 36) : [Exception] Can't add 
Original item to Correction if there is a Current item

Which puzzles me too.

proycon commented 3 years ago

I really think foliavalidator is wrong here

Agreed, the document looks okay, the validator seems to miss a word entirely when getting the deep text..

I tried to rename the node to . Which I thought might solve the problem, but then: Original item to Correction if there is a Current item

That's a legit error, Original can only come with New, not with Current (Current (and New) can come with Suggestion).

kosloot commented 3 years ago

That's a legit error, Original can only come with New, not with Current (Current (and New) can come with Suggestion).

Ok, I will add this limitation to libfolia too.

proycon commented 3 years ago

There was already a failing test case I missed that covers this issue actually, the release was a bit too premature...

kosloot commented 3 years ago

Ok, I will add this limitation to libfolia too.

Done. Needed quite a bit of rework, unfortunately

proycon commented 3 years ago

This should now be solved, the tests are all green again and the example you gave also validates. Has been released as foliapy v2.5.3.

kosloot commented 3 years ago

Still, I assume something is wrong. see: special.txt

Here the (unlogical) correction from class="current" to class="Ticcl" is made. I still assume the FoLiA is correct, but foliavalidator rejects it. folialint accepts this construction.

proycon commented 3 years ago

yes, you're right, there's still some assumption that the "current" class is always the latest/most-current class. I'll see if we can get rid of that without breaking anything else.

kosloot commented 3 years ago

I'm not sure about the Python implementation, but libfolia checks the text for every textclass in the document. So in this case both the implicit "current" and "Ticcl".

proycon commented 3 years ago

yeah, foliapy does the same