proycon / foliatools

A number of command-line tools for working with FoLiA (Format for Linguistic Annotation). Includes validators, converters, visualisers, and more.
GNU General Public License v3.0
10 stars 4 forks source link

[foliatextcontent] propagate markup information to higher/lower levels #19

Open proycon opened 3 years ago

proycon commented 3 years ago

If there is markup information in a higher text layer, say on paragraph level, we want to be able to replicate that markup information on lower levels (say sentence or words), if not yet available. We also want the reverse, if there markup information on lower levels, we want to express it also on higher levels.

proycon commented 3 years ago

(relates also to #16)

proycon commented 3 years ago

Technical note: This does introduce another challenge. Just like we have text consistency and text validation in FoLiA, ensuring that text specified on multiple levels is consistent with eachother, this would introduce a similar concept of markup consistency, markup validation.

kosloot commented 3 years ago

This seems like something you would like to have in 'some' occasions, but not always. e.g. for a tokenizer, you would like to have the 'un-styled' strings. So maybe we must introduce a 'formatted' attribute or such in the \<t> nodes?

<t>This is a good example</>

vs.

<t formatted="1">This is a<t-style class="bold">good</t-style> example</t>

On second thought: this is not a really good idea, it would break to many things. Still we need both worlds. So a more down to earth solution is adding text() variants that maintain the structure. keeping the current text() and str() functions. The return value probably being a TextContent.

proycon commented 3 years ago

I don't think there's any need for such an attribute and don't really see what problem it would solve. Calling text() on a TextContent element returns the plain text (regardless of any markup within), similarly an x-path text() does the same, we definitely shouldn't change that.

Getting all the markup requires calling textcontent() and then diving deeper into that, it's a bit more complex by definition but that can't be helped I think.

kosloot commented 3 years ago

Yes, in fact that was my conclusion too. The new function I mentioned should do the deeper diving. An return a TextContent which holds the (combined) styles of the deeper elements

pirolen commented 3 years ago

[Not sure if this is the right place to ask] Would the font styling info propagation also work in Ucto?

My FLAT use case is to tokenize the text first in order to enable the (fully manual) word-level spelling error corrections. I would be happy if this would be achievable, even if decoupled from viewing/accepting TICCL's suggestions (which one could visualize in parallel in an editor, or a separate FLAT window, for the moment).

pirolen commented 3 years ago

[Related] Currently I don't seem to successfully call textcontent() on a paragraph's parts using foliapy :-( and only the 1st part is accessed... I did

for par in doc.paragraphs(): 
    for part in par.annotation(folia.Part):
        print(part.text()) ## works
        print(type(part.textcontent())) ## raises an error.

File "/home/ubuntu/piro/projects/lamadev/lmdev/lib/python3.6/site-packages/folia/main.py", line 1199, in textcontent raise NoSuchText folia.main.NoSuchText

Specifying the class as ... .textcontent(cls="OCR") does not seem to make a difference.

pirolen commented 3 years ago

Just wondering if one can add some dummy placeholder style-markup annotation to ucto-tokenized folia.xml in FLAT (regardless of the propagation not yet being in place). Would it be in principle possible, just for the sake of carrying out some test annotation round?

proycon commented 3 years ago

Would the font styling info propagation also work in Ucto?

It would be a separate post-processing step you need to run after ucto

Just wondering if one can add some dummy placeholder style-markup annotation to ucto-tokenized folia.xml in FLAT (regardless of the propagation not yet being in place).

FLAT doesn't really render the style-markup at all currently.

pirolen commented 3 years ago

It would be a separate post-processing step you need to run after ucto

That would be cool in that way too.

FLAT doesn't really render the style-markup at all currently.

I should have asked rather: would it be possible to do some post processing too, after having annotated in FLAT, so that the style information is possible to re-assign so that again other tools, such as folia2html, can process it?