Multiword tokens in trained tokenizer.

ufal / udpipe

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files

Mozilla Public License 2.0

359 stars 75 forks source link

Multiword tokens in trained tokenizer. #149

Closed zerogerc closed 3 years ago

zerogerc commented 3 years ago

Hi, thanks for the great tool!

I have a question regarding tokenization. I've trained udpipe1 model on en_ewt_2.6 and seeing multiword tokens in the output:

# sent_id = 1
# text = I don't know.
1       I       I       PRON    PRP     Case=Nom|Number=Sing|Person=1|PronType=Prs      4       nsubj   _       TokenRange=0:1
2-3     don't   _       _       _       _       _       _       _       TokenRange=2:7
2       do      do      AUX     VBP     Mood=Ind|Tense=Pres|VerbForm=Fin        4       aux     _       _
3       n't     not     PART    RB      _       4       advmod  _       _
4       know    know    VERB    VB      VerbForm=Inf    0       root    _       SpaceAfter=No|TokenRange=8:12
5       .       .       PUNCT   .       _       4       punct   _       SpaceAfter=No|TokenRange=12:13

At the same time, the udpipe server (http://lindat.mff.cuni.cz/services/udpipe/) does not output them.

I wonder if there is an option in a tokenizer to disable such an ids?

P.S. As far as I know en_ewt dataset has no multiword ids. I'm not sure why tokenizer trained on en_ewt outputs them.

foxik commented 3 years ago

The multiword tokens in en_ewt appeared in UD 2.7 -- did you by any chance used UD 2.7 or UD 2.8?

There is unfortunately no option to disable such multiword tokens on the output; currently you need to change the training data and retrain the tokenizer (or post-process its results) if you want output with different properties...

zerogerc commented 3 years ago

I see, thank you! I think I might've downloaded 2.7 version.

A one more question about tokenization. Trained tokenizer split the sentence Margaret Thatcher who hated trains refused into Margaret and Thatcher who hated trains refused. Is there by any chance a way to disable sentence split in the tokenizer? Or maybe a way to split only on full stops.

foxik commented 3 years ago

The sentence splitter is currently quite weak, unfortunately (it will be adressed in UDPipe 3), and the English EWT data contains headings as sentences (i.e., not terminated by fullstop or other punctuation), so the English EWT model sometimes split sentences too eagerly.

The tokenizer can operate in presegmented regime where the sentence splits are given (one sentence on a line) and it only generates words, so you could for example split on a given regex manually and only then run the tokenizer.

zerogerc commented 3 years ago

Am I right that I need to pass presegmented into tokenizer_options?

tokenizer = self._tokenizer.newTokenizer(tokenizer_options)

How can I do it if I already pass ranges like this:

self._tokenizer.newTokenizer('ranges')

I guess I need some separator but I can't figure out from the code.

foxik commented 3 years ago

Exactly. It is mentioned only in the User manual (and not in the API one) https://ufal.mff.cuni.cz/udpipe/1/users-manual#run_udpipe_tokenizer, but it is definitely "hidden" instead of being at a prominent place:

where data is a semicolon-separated list of the following options:

zerogerc commented 3 years ago

Nice, thank you. You saved me a lot of time

foxik commented 3 years ago

You are welcome :-)

I am closing the issue, but feel free to continue it if required.