Adding multiple text content elements in FLAT

pirolen commented 1 year ago

I wonder if FLAT supports adding multiple text content elements https://folia.readthedocs.io/en/latest/text_annotation.html#text-annotation]. I would have a use case for it: there are several versions of a historical text: source text, a later version of the text, and their normalized orthography, as well as OCR.

Bringing together all these layers (on the token level) programmatically is not truly possible. (I tried a bit but the file got big and FLAT had a gateway error).

The idea would be that users can enter the words for the different textclass layers via FLAT.

So I prepared thus a oneliner based on the example in the documentation (<t>Hello. This is a sentence. Bye!</t> etc.), please see attached. It is not yet tokenized, to keep it simple, and also I thought that in the use case the users could enter bigger chunks of texts (e.g. sentences).

This toy file renders OK in FLAT, and I can change between the three different text contents using the Selector.

But if I try to add a correction to the content in one of the layers (which is thus untokenized), all other layers seem to get modified too (screenshot). And selecting the view of the other layers is not working well anymore (either shows nothing or shows the layer that was corrected). Maybe adding a correction in this way is not allowed?

I did not seem to get how to make FLAT accept a fully newly entered text, tied to an existing one -- either as correction or as new text content. Please advise if this would be possible (I tried the string-annotation, but entering a word throws and error, cf. screenshot). I am also somewhat unsure how to declare the annotation set that would enable this action.

Furthermore, FLAT is not able to revert to any of the previous document versions, it throws an error (screenshot).

Many thanks if you have the chance for looking into this. I use the docker FLAT.

bbb2.folia.xml.txt

pirolen commented 1 year ago

I did a bit more exploration, trying to see what happens if I have tokenized text. I run ucto on the above xml file, and I could tokenize one layer of text content, but not further ones ("ucto:Difficult to tokenize 'bbb2.ocr-ucto.folia.xml' again, already processed by ucto before!"). So perhaps this use case is not viable?

Nevertheless, I attach a screenshot of trying to add new text content, e.g. to the token 'Sentence'. What is one supposed to enter in the dialog box? E.g. add a feature, where class is the text itself, plus a value for the feature subset, which needs to be declared in the set definition?

proycon commented 1 year ago

I wonder if FLAT supports adding multiple text content elements

Good question. I think it's a bit of a grey area where things quickly become unsupported, and as you found out, things quickly become buggy. You're touching or even crossing the limits of what FLAT is currently capable of, unfortunately.

But if I try to add a correction to the content in one of the layers (which is thus untokenized), all other layers seem to get modified too (screenshot). And selecting the view of the other layers is not working well anymore (either shows nothing or shows the layer that was corrected). Maybe adding a correction in this way is not allowed?

I think corrections only work on the token level, applying them on higher levels has never really been properly implemented in FLAT or even in FoLiA itself (even though it would technically be allowed).

So edits on sentences should always be direct (D), or (N) if you want to add text with a new text class, but definitely not corrections (C). However, there indeed seems to be a bug here, changing the text content for one layer changes them all.

I did not seem to get how to make FLAT accept a fully newly entered text, tied to an existing one -- either as correction or as new text content.

This too seems a clear bug in FLAT, I reproduced it: Adding a new "Text" only shows a field to select the text class (from the default set annotation), but the expected field for the actual text never shows.

Nevertheless, I attach a screenshot of trying to add new text content, e.g. to the token 'Sentence'. What is one supposed to enter in the dialog box? E.g. add a feature, where class is the text itself, plus a value for the feature subset, which needs to be declared in the set definition?

No, there should have been a text field. Adding a Feature like in the screenshot is definitely not what you want here. But I can't blame you for trying since the text field is missing and things are confusing enough ;)

(I tried the string-annotation, but entering a word throws and error, cf. screenshot).

Don't use string-annotation no, support for adding string annotation has never been implemented, and it's not what you need here anyway.

I am also somewhat unsure how to declare the annotation set that would enable this action.

You'd have to load a document with <text-annotation set=".."> explicitly set to your custom set. I don't think the interface allows adding a text annotation set.

As to your over-arching question "Please advise if this would be possible". Currently things seem too broken for this to work, I think if both bugs were solved it would be possible to add/edit text content with multiple classes, but with the following constraints:

only in direct edit mode, no correction mode
editing sentences will only work when there is no underlying tokenisation (so it won't work on the ucto files), otherwise it introduces text consistency problems that FLAT can't handle. The reverse also holds, editing words would only work if there is not text on higher levels.
editing text content does not take into account any of the hyphenated breaks or any other markup that may be in it (it would get stripped away entirely). Markup elements can only be visualised in FLAT (to an extent, not edited).

So I kind of wonder if fixing these bugs will bring FLAT into a state that makes it useful enough for your use case. If not, then it may not be worth the effort anymore to try to fix them. What do you think?

I should add that the future of FLAT is very uncertain at this point, it's a fairly old and sufficiently complex codebase, and there are only a few users. FLAT is maintained and funded as part of the CLARIAH project (WP3), but that entire project is coming to an end this year, which will most likely put FLAT in End-of-Life/Deprecated status unless there's interest in a revival from another project (but I myself am even a bit skeptical whether that's still worth it).

pirolen commented 1 year ago

Ah, I see, very sad to hear that FLAT may become deprecated, since it is such a great support for enriching FoLiA documents. Do/will people in your projects use another annotation environment?

Depending on your capacities, if some of the things are debuggable, we would be happy to use FLAT further. We could also try if a developer here could contribute to the software. What do you think?

proycon commented 1 year ago

Depending on your capacities, if some of the things are debuggable, we would be happy to use FLAT further.

I can definitely look into the two bugs you found if that's enough for your use-case, but I do wonder if the constraints I mentioned are not too limiting?

We could also try if a developer here could contribute to the software. What do you think?

Contributions are of course always welcome, but the code-base isn't the most accessible I'm afraid, so it will be difficult. Ideally, the front-end code needs proper rewrite (which I already suggested in #135 in 2018, the code is almost 10 years old now), but that's a huge project and not going to happen anymore.

Do/will people in your projects use another annotation environment?

My own preference has shifted to more lightweight solutions, whereas FLAT is a very comprehensive environment that tries to accommodate most of FoLiA (and FoLiA itself is quite comprehensive). This was by design and FLAT's greatest strength, but also its greatest weakness probably as things get complex quickly (as we notice in this issue) and FLAT is not an easily reusable component in other contexts, it's by definition married to FoLiA.

In the field (my view is limited though), I've seen simple solutions built on libraries like Recogito-JS, usually specific for a certain annotation task in a project. There's https://github.com/zenml-io/awesome-open-data-annotation which tries to keep a nice list of manual annotation tools (FLAT's in there too).

pirolen commented 1 year ago

Thank you very much. I am going to restrict the use case to FLAT's capabilities then, after debugging, and am going ask a developer here to look into the frontend upgrade mentioned in the related issue.

pirolen commented 1 year ago

P.S. Is FoLiA and its tools going to be maintained after CLARIAH ends?

proycon commented 1 year ago

FoLiA is indeed funded from CLARIAH as well, so the same problem applies. I'm trying to at least ensure some limited funding for continued maintenance & support (excluding large feature developments) of FoLiA, Frog, ucto, to ensure basic continuity, but all that is unclear still. We're happy to also have @kosloot actively involved in his free retirement time, that of course also helps a lot! But continuity of research software needs proper attention and funding from projects or institutes in order to be really sustainable, and that's often difficult unfortunately.

Btw, I'm also been working on other annotation solutions (STAM ) where transition from FoLiA is explicitly included (but that too is in the scope of CLARIAH).

pirolen commented 1 year ago

OMG... I hope that inland funding continues for the awesome infrastucture of you guys, otherwise we could come up with an international solution? :-)

proycon commented 9 months ago

Some internal notekeeping on the debugging for this issue:

Bug 1 is caused by the FQL query being too broad:

USE flat/bbb2 PROCESSOR name "flat" type manual IN $FLAT_PROCESSOR IN $FOLIADOCSERVE_PROCESSOR EDIT t OF https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl WITH text "Hell0 Th15 iz a sentence, Bye1" datetime now confidence NONE textclass "ocroutput" FOR ID example.p.1 FORMAT flat RETURN target

The correct query (tested to work) needs a WHERE clause on textclass:

USE flat/bbb2 PROCESSOR name "flat" type manual IN $FLAT_PROCESSOR IN $FOLIADOCSERVE_PROCESSOR EDIT t OF https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl WHERE textclass = "ocroutput" WITH text "Hell0 Th15 iz a sentence, Bye1" datetime now confidence NONE textclass "ocroutput" FOR ID example.p.1 FORMAT flat RETURN target

Working on a fix now...

proycon commented 9 months ago

The two bugs should now be resolved in flat v0.11.4

proycon / flat

Adding multiple text content elements in FLAT #188