how to prevent sentence detection for tokenaztion?

ufal / udpipe

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files

Mozilla Public License 2.0

358 stars 75 forks source link

how to prevent sentence detection for tokenaztion? #30

Closed rasoolims closed 7 years ago

rasoolims commented 7 years ago

Hi, I was not able to find any option to turn off sentence delimitation for tokenization. My data needs to be with the same number of sentences after tokenization but UDPIPE separates some of the sentences. Thanks

martinpopel commented 7 years ago

In the new version (to be released very soon), there should be a special input format for sentence-segmented but untokenized input (one sentence per line), I believe.

In the current version v1.0.0, you have to use a workaround where you merge together the extra-split sentences (or maybe input_format_presegmented_tokenizer, but I am not sure how to use it).

rasoolims commented 7 years ago

What should be the path for PERL_INCLUDE? I tried different paths to install Perl bindings but was not successful.

martinpopel commented 7 years ago

If you need the Perl binding, install it from CPAN, e.g. with cpanm Ufal::UDPipe. The workaround I referenced can be easily done in any other of the supported programming languages.

That said, I would suggest to wait for the new release. It will published soon and the new models for this release were already published.

foxik commented 7 years ago

As @martinpopel said, tokenizer respecting given segmentation will be present in UDPipe 1.1.

As for PERL_INCLUDE, if you prefer to compile the bindings manually instead of using cpan, PERL_INCLUDE should point to a directory containing perl.h, as the Makefile suggests. On my machine it is at /usr/lib/x86_64-linux-gnu/perl/5.20/CORE.

rasoolims commented 7 years ago

I found a hacky way to do this by changing this line to the following:

return input_format::new_presegmented_tokenizer(result);

foxik commented 7 years ago

This is what the new code does (but in a configurable way :-), see https://github.com/ufal/udpipe/blob/pre1.1/src/model/model_morphodita_parsito.cpp#L37 .

foxik commented 7 years ago

The UDPipe 1.1 is out, with presegmented tokenizer option respecting input segmentation (documented at https://ufal.mff.cuni.cz/udpipe/users-manual#run_udpipe_tokenizer).