proycon / python-ucto

This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).
29 stars 5 forks source link

Adding the tokenizer contents to a FoLiA doc #14

Open pirolen opened 1 year ago

pirolen commented 1 year ago

I wonder if there is a straightforward way to add from the tokenizer the sentences and their token content to build a new folia doc. It is not clear to me how to do that with the add method: is one supposed to recursively access sentences and tokens from the tokenizer that yields Token types, and subsequently render the token contents by scripting (e.g. accessing a token class and then specifying it for a folia.Word annotation), or is there an direct way to add the tokenizer content structure to the FoLiA doc?

Or is python-ucto not meant to be used for that, and one should rather first create a folia doc with untokenized content and run CLI ucto on it?

proycon commented 1 year ago

This is already built-in functionality. You can just request FoLiA output from python-ucto using foliaoutput=True, see the example in the README.

pirolen commented 1 year ago

I have a Python dictionary that holds elements of a real-life dictionary, ie. headword and body. While iterating over the Python dict, I aim to add its tokenized content to a folia.entry element, so that I end up with a structure like below (created by CLI ucto).

So I am constucting the folia doc on the fly.

I initiated the tokenizer with tokenizer = ucto.Tokenizer("tokconfig-generic", foliaoutput=True)

and then the script goes like:

ft = doc_out.add(folia.Text)
for k, v in entrydict.items():
    ctr += 1
    ''' Add new entry and ID '''
    ent = ft.add(folia.Entry, id='e'+str(ctr))
    try:
        ''' Process and add `Term` content  '''
        fterm = ent.add(folia.Term)
        ''' Create and access tokenised data from ucto tokenizer '''
        tokenizer.process(k.strip())
       fterm.add(???)

How shall I add the sentences and tokens from the tokenizer to the Term element?


   <entry xml:id="e2">
      <term xml:id="e2.term.1">
        <s xml:id="e2.term.1.s.1">
          <w xml:id="e2.term.1.s.1.w.1" class="WORD">
            <t>Ab</t>
          </w>
        </s>
      </term>
      <def xml:id="e2.def.1">
        <p xml:id="e2.def.1.p.1">
          <s xml:id="e2.def.1.p.1.s.1">
            <w xml:id="e2.def.1.p.1.s.1.w.1" class="WORD">
              <t>apud</t>
            </w>
            <w xml:id="e2.def.1.p.1.s.1.w.2" class="WORD">
              <t>Hebraeos</t>
            </w>
            <w xml:id="e2.def.1.p.1.s.1.w.3" class="WORD">
              <t>dicitur</t>
...
proycon commented 1 year ago

Ah ok, you're feeding parts to the tokenizer on the fly, that probably doesn't combine well with foliaoutput=True indeed, as that produces entire documents for the input. You're on the right track:

is one supposed to recursively access sentences and tokens from the tokenizer that yields Token types, and subsequently render the token contents by scripting (e.g. accessing a token class and then specifying it for a folia.Word annotation)

Yes, if you want to do it on-the-fly then there's no shortcut unfortunately.

How shall I add the sentences and tokens from the tokenizer to the Term element?

Iterate over tokenizer and call fterm.add()

pirolen commented 1 year ago

OK, I see. I tried that earlier but got stuck since I am not sure how to safely

This might be a lot of overhead and I might be better off indeed to create the doc first and then run ucto over it.

Btw, in another ticket I made some notes about not being able to specify the textclass when passing foliaoutput=True: https://github.com/proycon/python-ucto/issues/13#issue-1624315384

proycon commented 1 year ago
  • access sentence starts/ends

The Token instance has two methods to determine if it is at the start/end of a sentence:

Similarly, there is atoken.newparagraph()(token starts a new paragraph) and a token.nospace() (token is NOT followed by a space).

  • harmonize Tokens (which the tokenizer holds) with folia.Word annotation (which the folia doc expects to be added to sentences). E.g. one needs to access the a token class and then specify it for a folia.Word annotation. I would be indebted for some example code.

From the top of my head (untested so there may be mistakes), take body to be the FoLiA structure where you want to add sentence and tokens (some subclass of folia.AbstractStructureElement):

for token in tokenizer:
   sentence = None
   if token.isbeginofsentence():
      sentence = body.add(folia.Sentence)

   word = sentence.add(folia.Word, str(token), space=not token.nospace())

Btw, in another ticket I made some notes about not being able to specify the textclass when passing foliaoutput=True: https://github.com/proycon/python-ucto/issues/13#issue-1624315384

Ah right! Sorry, missed that one, will take a look!

pirolen commented 1 year ago

I get 'Token' object has no attribute 'newparagraph' :-o

proycon commented 1 year ago

Right, it should be isnewparagraph().