sentence segmentation problem

ufal / udpipe

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files

Mozilla Public License 2.0

364 stars 77 forks source link

text = 我是一個

1 我我 PRON n,代名詞,人称,止格 Person=1|PronType=Prs 4 nsubj SpaceAfter=No 2 是是 PRON n,代名詞,指示,* PronType=Dem 4 nsubj SpaceAfter=No 3 一一 NUM n,数詞,数字,* 4 nummod SpaceAfter=No 4 個個 NOUN n,名詞,描写,形質 0 root SpaceAfter=No

text = 好學生。

1 好好 VERB v,動詞,行為,態度 0 root SpaceAfter=No 2 學學 NOUN n,名詞,行為,* 1 obj SpaceAfter=No 3 生生 VERB v,動詞,変化,生物 1 obj SpaceAfter=No 4 。。 VERB v,動詞,描写,態度 3 obj SpaceAfter=No

Hi,

you are using lzh as language, which is https://en.wikipedia.org/wiki/Classical_Chinese ; for regular Chinese, you can use either zh_gsdsimp-ud for Simplified Chinese or zh_gsd-ud for Chinese
both zh_gsd-ud and zh_gsdsimp-ud considers the input to be a single sentence
the reason why lzh splits the given text into two sentences is that
- the tokenizer is purely data-driven, and learns to split text into sentences using data from the UD-treebank only
- the lzh treebank https://github.com/UniversalDependencies/UD_Classical_Chinese-Kyoto contains many sentences not ending with a fullstop; therefore, the tokenizer tries to use similar approach of splitting sentences, which results in too eager splits

We plan to change the tokenizer algorithm considerably; however, lzh will still probably split into sentences too eagerly, because that is what the UD treebank does.

ufal / udpipe

sentence segmentation problem #172

sent_id = 1

text = 我是一個

sent_id = 2

text = 好學生。