Closed martinpopel closed 7 years ago
This is currently not possible, but we can add a method to the Tokenizer
. For example, in addition to current setText
and nextSentence
we could also have tokenizeSentence
. It would probably have a default implementation, which internally calls setText
& nextSentences
while merging the results.
Do you think it is worth it? It is quite little work.
Thanks. I just wanted to ask if I haven't missed anything. Meanwhile, I've implemented the merging myself (so far without multiword tokens). So for me (i.e. for Udapi) it is not worth a new release of UDPipe, so you can close this issue. Let's wait if other users request this feature.
A comment in POD says "tokenizing on request", but I was not able to find out how to request the tokenization in the run_udpipe.pl example. Only three input formats are currently available (vertical, horizontal and conllu) and all need already tokenized text (event the horizontal one, according to its specification). I've deleted the comment and added an example of tokenization without segmentation in #4.
You can request the tokenization in run_udpipe.pl by using tokenizer
input format (unfortunately, that is probably not documented anywhere, because I did not get to finish the documentation).
I am not going to add the no_segmentation.pl
, because it is Perl-only solution to a general UDPipe issue. I want all the functionality to be available "everywhere", so if we want this, it should be accessible from all of the following:
One possibility is to use 'tokenizer options' to allow this, as the API allows arbitrary options for the tokenizer, tagger, and the parser (for example beam_search
for the parser). That way the functionality would be accessible everywhere.
Implemented in 8c4cc229, the documentation will hopefully arrive soon.
Is it possible to run just the tokenizer without segmenter? Of course, if the sentence gets divided into more segments I can merge merge them (calling
addWord()
on the first Ufal::UDPipe::Sentence segment to add words from the other segments), but it is an extra work, especially if I want to handle also multiword tokens.