ufal / udpipe

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files
Mozilla Public License 2.0
364 stars 77 forks source link

sentence segmentation problem #172

Closed wongtaksum closed 1 year ago

wongtaksum commented 1 year ago

I tried to parse "我是一個好學生。" but it breaks into two sentences. Did I do anything wrong? def nlp(t): import urllib.request,urllib.parse,json with urllib.request.urlopen("https://lindat.mff.cuni.cz/services/udpipe/api/process?model=lzh&tokenizer&tagger&parser&data="+urllib.parse.quote(t)) as r: return json.loads(r.read())["result"] doc=nlp("我是一個好學生。") print(doc)

sent_id = 1

text = 我是一個

1 我 我 PRON n,代名詞,人称,止格 Person=1|PronType=Prs 4 nsubj SpaceAfter=No 2 是 是 PRON n,代名詞,指示,* PronType=Dem 4 nsubj SpaceAfter=No 3 一 一 NUM n,数詞,数字,* 4 nummod SpaceAfter=No 4 個 個 NOUN n,名詞,描写,形質 0 root SpaceAfter=No

sent_id = 2

text = 好學生。

1 好 好 VERB v,動詞,行為,態度 0 root SpaceAfter=No 2 學 學 NOUN n,名詞,行為,* 1 obj SpaceAfter=No 3 生 生 VERB v,動詞,変化,生物 1 obj SpaceAfter=No 4 。 。 VERB v,動詞,描写,態度 3 obj SpaceAfter=No

foxik commented 1 year ago

Hi,

We plan to change the tokenizer algorithm considerably; however, lzh will still probably split into sentences too eagerly, because that is what the UD treebank does.