proycon / foliatools

A number of command-line tools for working with FoLiA (Format for Linguistic Annotation). Includes validators, converters, visualisers, and more.
GNU General Public License v3.0
10 stars 4 forks source link

[foliatextcontent] Implement adding markup information in the text that points to the substrings #23

Closed proycon closed 3 years ago

proycon commented 3 years ago

This is needed for proycon/flat#92 . There is already an option for this in foliatextcontent but it doesn't seem to work yet in all cases , most specifically, the case where the text content is already present rather than generated by foliatextcontent.

proycon commented 3 years ago

<correction> elements over <str> should be translated to the proper <t-correction> elements.

pirolen commented 3 years ago

Awesome, thanks so much for making this enhancement!

Just wondering (not requesting), would this enable manual correction operations, at least partly, such as: when a superscript number ("17") was misrecognized as apostrophe: <t class="OCR" offset="1086">Materialien'</t>

-->

(Once the PAGE-XML to FoLiA converter is there, I could use ucto and generate a test file --please let me know I could do st more.)

proycon commented 3 years ago

(I don't think this relates directly to this issue, which is about substrings (arbitrary references on untokenised text))

I assume you refer to manual annotation in FLAT, and editing corrections in FLAT indeed only works on the token-level. If a tokenised document is available with all the markup information present then the procedure you described would work for the first three steps yes, but the fourth is still an issue as FLAT doesn't support annotating markup (e.g. style) yet, the markup support in FLAT is limited to viewing currently.

The other caveat is preserving all the markup information after tokenisation, ucto currently doesn't do that. You're currently stuck with the markup information on mostly the paragraph level. Neither TICCL nor ucto propagate it to deeper levels, which is what you need if you want to correct it in FLAT. I had already opened a related issue to implement this specific functionality in foliatextcontent: #19 .. The good news is that this should all be automatically resolvable.

pirolen commented 3 years ago

Awesome, thanks! Sorry about commenting at the wrong issue, I meant indeed the functionality of FLAT.