Closed wongtaksum closed 1 year ago
Hi,
lzh
as language, which is https://en.wikipedia.org/wiki/Classical_Chinese ; for regular Chinese, you can use either zh_gsdsimp-ud
for Simplified Chinese or zh_gsd-ud
for Chinesezh_gsd-ud
and zh_gsdsimp-ud
considers the input to be a single sentencelzh
splits the given text into two sentences is that
lzh
treebank https://github.com/UniversalDependencies/UD_Classical_Chinese-Kyoto contains many sentences not ending with a fullstop; therefore, the tokenizer tries to use similar approach of splitting sentences, which results in too eager splitsWe plan to change the tokenizer algorithm considerably; however, lzh
will still probably split into sentences too eagerly, because that is what the UD treebank does.
I tried to parse "我是一個好學生。" but it breaks into two sentences. Did I do anything wrong? def nlp(t): import urllib.request,urllib.parse,json with urllib.request.urlopen("https://lindat.mff.cuni.cz/services/udpipe/api/process?model=lzh&tokenizer&tagger&parser&data="+urllib.parse.quote(t)) as r: return json.loads(r.read())["result"] doc=nlp("我是一個好學生。") print(doc)
sent_id = 1
text = 我是一個
1 我 我 PRON n,代名詞,人称,止格 Person=1|PronType=Prs 4 nsubj SpaceAfter=No 2 是 是 PRON n,代名詞,指示,* PronType=Dem 4 nsubj SpaceAfter=No 3 一 一 NUM n,数詞,数字,* 4 nummod SpaceAfter=No 4 個 個 NOUN n,名詞,描写,形質 0 root SpaceAfter=No
sent_id = 2
text = 好學生。
1 好 好 VERB v,動詞,行為,態度 0 root SpaceAfter=No 2 學 學 NOUN n,名詞,行為,* 1 obj SpaceAfter=No 3 生 生 VERB v,動詞,変化,生物 1 obj SpaceAfter=No 4 。 。 VERB v,動詞,描写,態度 3 obj SpaceAfter=No