Open proycon opened 3 years ago
(relates also to #16)
Technical note: This does introduce another challenge. Just like we have text consistency and text validation in FoLiA, ensuring that text specified on multiple levels is consistent with eachother, this would introduce a similar concept of markup consistency, markup validation.
This seems like something you would like to have in 'some' occasions, but not always. e.g. for a tokenizer, you would like to have the 'un-styled' strings. So maybe we must introduce a 'formatted' attribute or such in the \<t> nodes?
<t>This is a good example</>
vs.
<t formatted="1">This is a<t-style class="bold">good</t-style> example</t>
On second thought: this is not a really good idea, it would break to many things. Still we need both worlds. So a more down to earth solution is adding text() variants that maintain the structure. keeping the current text() and str() functions. The return value probably being a TextContent.
I don't think there's any need for such an attribute and don't really see what problem it would solve. Calling text()
on a TextContent element returns the plain text (regardless of any markup within), similarly an x-path text()
does the same, we definitely shouldn't change that.
Getting all the markup requires calling textcontent()
and then diving deeper into that, it's a bit more complex by definition but that can't be helped I think.
Yes, in fact that was my conclusion too. The new function I mentioned should do the deeper diving. An return a TextContent which holds the (combined) styles of the deeper elements
[Not sure if this is the right place to ask] Would the font styling info propagation also work in Ucto?
My FLAT use case is to tokenize the text first in order to enable the (fully manual) word-level spelling error corrections. I would be happy if this would be achievable, even if decoupled from viewing/accepting TICCL's suggestions (which one could visualize in parallel in an editor, or a separate FLAT window, for the moment).
[Related]
Currently I don't seem to successfully call textcontent()
on a paragraph's parts using foliapy :-(
and only the 1st part is accessed... I did
for par in doc.paragraphs():
for part in par.annotation(folia.Part):
print(part.text()) ## works
print(type(part.textcontent())) ## raises an error.
File "/home/ubuntu/piro/projects/lamadev/lmdev/lib/python3.6/site-packages/folia/main.py", line 1199, in textcontent raise NoSuchText folia.main.NoSuchText
Specifying the class as ... .textcontent(cls="OCR") does not seem to make a difference.
Just wondering if one can add some dummy placeholder style-markup annotation to ucto-tokenized folia.xml in FLAT (regardless of the propagation not yet being in place). Would it be in principle possible, just for the sake of carrying out some test annotation round?
Would the font styling info propagation also work in Ucto?
It would be a separate post-processing step you need to run after ucto
Just wondering if one can add some dummy placeholder style-markup annotation to ucto-tokenized folia.xml in FLAT (regardless of the propagation not yet being in place).
FLAT doesn't really render the style-markup at all currently.
It would be a separate post-processing step you need to run after ucto
That would be cool in that way too.
FLAT doesn't really render the style-markup at all currently.
I should have asked rather: would it be possible to do some post processing too, after having annotated in FLAT, so that the style information is possible to re-assign so that again other tools, such as folia2html, can process it?
If there is markup information in a higher text layer, say on paragraph level, we want to be able to replicate that markup information on lower levels (say sentence or words), if not yet available. We also want the reverse, if there markup information on lower levels, we want to express it also on higher levels.