Open pirolen opened 1 year ago
This is already built-in functionality. You can just request FoLiA output from python-ucto using foliaoutput=True
, see the example in the README.
I have a Python dictionary that holds elements of a real-life dictionary, ie. headword and body. While iterating over the Python dict, I aim to add its tokenized content to a folia.entry element, so that I end up with a structure like below (created by CLI ucto).
So I am constucting the folia doc on the fly.
I initiated the tokenizer with
tokenizer = ucto.Tokenizer("tokconfig-generic", foliaoutput=True)
and then the script goes like:
ft = doc_out.add(folia.Text)
for k, v in entrydict.items():
ctr += 1
''' Add new entry and ID '''
ent = ft.add(folia.Entry, id='e'+str(ctr))
try:
''' Process and add `Term` content '''
fterm = ent.add(folia.Term)
''' Create and access tokenised data from ucto tokenizer '''
tokenizer.process(k.strip())
fterm.add(???)
How shall I add the sentences and tokens from the tokenizer to the Term element?
<entry xml:id="e2">
<term xml:id="e2.term.1">
<s xml:id="e2.term.1.s.1">
<w xml:id="e2.term.1.s.1.w.1" class="WORD">
<t>Ab</t>
</w>
</s>
</term>
<def xml:id="e2.def.1">
<p xml:id="e2.def.1.p.1">
<s xml:id="e2.def.1.p.1.s.1">
<w xml:id="e2.def.1.p.1.s.1.w.1" class="WORD">
<t>apud</t>
</w>
<w xml:id="e2.def.1.p.1.s.1.w.2" class="WORD">
<t>Hebraeos</t>
</w>
<w xml:id="e2.def.1.p.1.s.1.w.3" class="WORD">
<t>dicitur</t>
...
Ah ok, you're feeding parts to the tokenizer on the fly, that probably doesn't combine well with foliaoutput=True
indeed, as that produces entire documents for the input. You're on the right track:
is one supposed to recursively access sentences and tokens from the tokenizer that yields Token types, and subsequently render the token contents by scripting (e.g. accessing a token class and then specifying it for a folia.Word annotation)
Yes, if you want to do it on-the-fly then there's no shortcut unfortunately.
How shall I add the sentences and tokens from the tokenizer to the Term element?
Iterate over tokenizer
and call fterm.add()
OK, I see. I tried that earlier but got stuck since I am not sure how to safely
This might be a lot of overhead and I might be better off indeed to create the doc first and then run ucto over it.
Btw, in another ticket I made some notes about not being able to specify the textclass when passing foliaoutput=True: https://github.com/proycon/python-ucto/issues/13#issue-1624315384
- access sentence starts/ends
The Token
instance has two methods to determine if it is at the start/end of a sentence:
token.isbeginofsentence()
token.isendofsentence()
Similarly, there is atoken.newparagraph()
(token starts a new paragraph) and a token.nospace()
(token is NOT followed by a space).
- harmonize Tokens (which the tokenizer holds) with folia.Word annotation (which the folia doc expects to be added to sentences). E.g. one needs to access the a token class and then specify it for a folia.Word annotation. I would be indebted for some example code.
From the top of my head (untested so there may be mistakes), take body
to be the FoLiA structure where you want to add sentence and tokens (some subclass of folia.AbstractStructureElement
):
for token in tokenizer:
sentence = None
if token.isbeginofsentence():
sentence = body.add(folia.Sentence)
word = sentence.add(folia.Word, str(token), space=not token.nospace())
Btw, in another ticket I made some notes about not being able to specify the textclass when passing foliaoutput=True: https://github.com/proycon/python-ucto/issues/13#issue-1624315384
Ah right! Sorry, missed that one, will take a look!
I get 'Token' object has no attribute 'newparagraph' :-o
Right, it should be isnewparagraph()
.
I wonder if there is a straightforward way to add from the tokenizer the sentences and their token content to build a new folia doc. It is not clear to me how to do that with the
add
method: is one supposed to recursively access sentences and tokens from the tokenizer that yields Token types, and subsequently render the token contents by scripting (e.g. accessing a token class and then specifying it for a folia.Word annotation), or is there an direct way to add the tokenizer content structure to the FoLiA doc?Or is python-ucto not meant to be used for that, and one should rather first create a folia doc with untokenized content and run CLI ucto on it?