ufal / udpipe

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files
Mozilla Public License 2.0
359 stars 75 forks source link

how to separate sentences #170

Closed hezaoke closed 1 year ago

hezaoke commented 1 year ago

Hi when using UDpipe, although I have separated sentences to different lines in a txt file, UDpipe automatically combines some of the sentences into one sentence. I understand this may be desirable sometimes. Is there a way to avoid this?

foxik commented 1 year ago

Hi,

yes, sentences can contain single newlines (in many texts there are line breaks that are just formatting). However, an empty line is always an end of paragraph (and therefore of an sentence) -- so just use two consecutive newline characters.

Also, if your input is already presegmented (divided to sentences), you can use the presegmented tokenizer option, and then the lines will be exactly sentences (but if you only have paragraphs and want UDPipe to split them into sentences, the mentioned empty line is the way to go).

Cheers!