Closed zerogerc closed 3 years ago
The multiword tokens in en_ewt
appeared in UD 2.7 -- did you by any chance used UD 2.7 or UD 2.8?
There is unfortunately no option to disable such multiword tokens on the output; currently you need to change the training data and retrain the tokenizer (or post-process its results) if you want output with different properties...
I see, thank you! I think I might've downloaded 2.7 version.
A one more question about tokenization. Trained tokenizer split the sentence Margaret Thatcher who hated trains refused
into Margaret
and Thatcher who hated trains refused
. Is there by any chance a way to disable sentence split in the tokenizer? Or maybe a way to split only on full stops.
The sentence splitter is currently quite weak, unfortunately (it will be adressed in UDPipe 3), and the English EWT data contains headings as sentences (i.e., not terminated by fullstop or other punctuation), so the English EWT model sometimes split sentences too eagerly.
The tokenizer can operate in presegmented
regime where the sentence splits are given (one sentence on a line) and it only generates words, so you could for example split on a given regex manually and only then run the tokenizer.
Am I right that I need to pass presegmented into tokenizer_options
?
tokenizer = self._tokenizer.newTokenizer(tokenizer_options)
How can I do it if I already pass ranges like this:
self._tokenizer.newTokenizer('ranges')
I guess I need some separator but I can't figure out from the code.
Exactly. It is mentioned only in the User manual (and not in the API one) https://ufal.mff.cuni.cz/udpipe/1/users-manual#run_udpipe_tokenizer, but it is definitely "hidden" instead of being at a prominent place:
where
data
is a semicolon-separated list of the following options:
Nice, thank you. You saved me a lot of time
You are welcome :-)
I am closing the issue, but feel free to continue it if required.
Hi, thanks for the great tool!
I have a question regarding tokenization. I've trained udpipe1 model on
en_ewt_2.6
and seeing multiword tokens in the output:At the same time, the udpipe server (http://lindat.mff.cuni.cz/services/udpipe/) does not output them.
I wonder if there is an option in a tokenizer to disable such an ids?
P.S. As far as I know
en_ewt
dataset has no multiword ids. I'm not sure why tokenizer trained onen_ewt
outputs them.