Tokenization without segmentation

ufal / udpipe

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files

Mozilla Public License 2.0

355 stars 74 forks source link

Tokenization without segmentation #3

Closed martinpopel closed 7 years ago

martinpopel commented 8 years ago

Is it possible to run just the tokenizer without segmenter? Of course, if the sentence gets divided into more segments I can merge merge them (calling addWord() on the first Ufal::UDPipe::Sentence segment to add words from the other segments), but it is an extra work, especially if I want to handle also multiword tokens.

foxik commented 8 years ago

This is currently not possible, but we can add a method to the Tokenizer. For example, in addition to current setText and nextSentence we could also have tokenizeSentence. It would probably have a default implementation, which internally calls setText & nextSentences while merging the results.

Do you think it is worth it? It is quite little work.

martinpopel commented 8 years ago

Thanks. I just wanted to ask if I haven't missed anything. Meanwhile, I've implemented the merging myself (so far without multiword tokens). So for me (i.e. for Udapi) it is not worth a new release of UDPipe, so you can close this issue. Let's wait if other users request this feature.

martinpopel commented 8 years ago

A comment in POD says "tokenizing on request", but I was not able to find out how to request the tokenization in the run_udpipe.pl example. Only three input formats are currently available (vertical, horizontal and conllu) and all need already tokenized text (event the horizontal one, according to its specification). I've deleted the comment and added an example of tokenization without segmentation in #4.

foxik commented 8 years ago

You can request the tokenization in run_udpipe.pl by using tokenizer input format (unfortunately, that is probably not documented anywhere, because I did not get to finish the documentation).

I am not going to add the no_segmentation.pl, because it is Perl-only solution to a general UDPipe issue. I want all the functionality to be available "everywhere", so if we want this, it should be accessible from all of the following:

the API
the binaries (probably using an option to the tokenizer)
from the Pipeline object

One possibility is to use 'tokenizer options' to allow this, as the API allows arbitrary options for the tokenizer, tagger, and the parser (for example beam_search for the parser). That way the functionality would be accessible everywhere.

foxik commented 7 years ago

Implemented in 8c4cc229, the documentation will hopefully arrive soon.