nltk / nltk_data

NLTK Data
1.4k stars 1.03k forks source link

Updated `[0]VP(eva` to `[0] VP(eva` in sinica_treebank, see nltk/nltk#2467 #168

Closed tomaarsen closed 2 years ago

tomaarsen commented 2 years ago

Hello!

Pull request overview

Details

See this snippet from parsed in sinica_treebank.zip:

#952:00952..[42846] NP(property:S‧的(head:S(agent:NP(Head:Nhaa:我們)|Head:VC2:教導|goal:NP(Head:Nab:孩子))|Head:DE:的)|quantifier:DM:第一項|Head:Nac:功課)#,(COMMACATEGORY)
#955:00955..[42849] S(time:Nv4:開始|agent:NP(Head:Nhaa:我們)|time:Dd:先|Head:VC2:教|goal:NP(Head:Nhaa:他們)|complement:VP(Head:VC2:穿|goal:NP(Head:Nab:鐵鞋)))#。(PERIODCATEGORY)
#959:00959..[42853] S(reason:Cbba:由於|theme:NP(property:NP‧的(head:NP(property:Nab:孩子|Head:Naea:們)|Head:DE:的)|Head:Nab:腿)|Head:VJ3:失去|range:NP(Head:Nac:作用))#,(COMMACATEGORY)
#960:00960..[42854] S(theme:NP(quantifier:DM:每一個|Head:Nac:動作)|target:PP(Head:P31:對|DUMMY:NP(Head:Nhaa:他們))|quantity:Dab:都|Head:V_11:是|range:VP(Head:VH11:困難|degree:Dfb:萬分))#。(PERIODCATEGORY)
#961:00961..[42855] VP(Head:VE2:看|goal:S(agent:NP(Head:Nhaa:他們)|manner:V‧地(head:VH11:吃力|Head:DE:的)|Head:VC2:抬|aspect:Di:著|goal:NP(property:V‧的(head:VH11:軟弱|Head:DE:的)|Head:Nab:腿)))#,(COMMACATEGORY)
#963:00963..[0]VP(evaluation:Dbb:仍然|deontics:Dbab:無法|theme:PP(Head:P07:把|DUMMY:Nab:腳)|Head:VC2:套進|goal:NP(property:Nab:鞋子|Head:Ncda:裡))#,(COMMACATEGORY)
#966:00966..[42858] S(theme:NP(Head:Nhaa:我們)|evaluation:Dbb:只有|Head:VJ1:耐|aspect:Di:著|range:NP(Head:Nad:性子)|complement:VP(DUMMY1:VP(DUMMY1:VP(Head:VC2:哄|aspect:Di:著)|Head:Caa:、|DUMMY2:VP(Head:VF2:勸|aspect:Di:著))|Head:Caa:、|DUMMY2:VP(Head:VF2:鼓勵|aspect:Di:著)))#,(COMMACATEGORY)
#967:00967..[42859] VP(time:Dd:常常|deontics:Dbab:要|Head:VC2:教|time:NP(quantifier:DM:好幾個|Head:Nac:星期)|complement:VP(time:Dd:才|deontics:Dbab:能|Head:VC2:學會))#。(PERIODCATEGORY)
#975:00975..[42867] S(agent:NP(Head:Nhaa:我們)|deontics:Dbab:要|Head:VC2:訓練|goal:NP(Head:Nhaa:他們)|complement:NP(property:S‧的(head:S(Head:S(theme:NP(Head:Nhaa:自己)|Head:VA11:跌倒)|Head:S(theme:NP(Head:Nhaa:自己)|Head:VA11:爬起來))|Head:DE:的)|Head:NP(DUMMY1:Nad:能力|Head:Caa:和|DUMMY2:Nad:觀念)))#。(PERIODCATEGORY)
#985:00985..[42876] S(agent:NP(Head:Nab:阿姨)|time:Nddc:現在|deontics:Dbab:可以|Head:VC2:拉|goal:NP(Head:Nhaa:你)|complement:VA11:起來)#,(COMMACATEGORY)
#986:00986..[42877] S(contrast:Cbca:但|time:DM:有一天|theme:NP(Head:Nhaa:你)|location:PP(Head:P21:在|DUMMY:NP(property:NP‧的(head:NP(negation:Dc:沒有|Head:Nab:人)|Head:DE:的)|Head:Nab:地方))|Head:VA11:跌倒|particle:Ta:了)#,(COMMACATEGORY)

(lines 5354-5353)

As you can see, the format is slightly off for the middle line - There is no space between [0] and VP. This causes issues when trying to load the sinica_treebank parsed corpus in NLTK:

from nltk.corpus import sinica_treebank

parsed_sents = sinica_treebank.parsed_sents()
full = list(parsed_sents)
print(len(full))
print(full[15])

throws:

Traceback (most recent call last):
  File "[sic]\nltk_2467.py", line 4, in <module>
    full = list(parsed_sents)
  File "[sic]\nltk\corpus\reader\util.py", line 240, in __len__
    for tok in self.iterate_from(self._toknum[-1]):
  File "[sic]\nltk\corpus\reader\util.py", line 306, in iterate_from
    tokens = self.read_block(self._stream)
  File "[sic]\nltk\corpus\reader\api.py", line 513, in _read_parsed_sent_block
    return list(filter(None, [self._parse(t) for t in self._read_block(stream)]))      
  File "[sic]\nltk\corpus\reader\api.py", line 513, in <listcomp>
    return list(filter(None, [self._parse(t) for t in self._read_block(stream)]))      
  File "[sic]\nltk\corpus\reader\sinica_treebank.py", line 64, in _parse      
    return sinica_parse(sent)
  File "[sic]\nltk\tree.py", line 1710, in sinica_parse
    return Tree.fromstring(treebank_string, remove_empty_top_bracketing=True)
  File "[sic]\nltk\tree.py", line 693, in fromstring
    cls._parse_error(s, "end-of-string", open_b)
  File "[sic]\nltk\tree.py", line 735, in _parse_error
    raise ValueError(msg)
ValueError: Tree.read(): expected '(' but got 'end-of-string'
            at index 1.
                " "
                  ^

This has been described in nltk/nltk#2467 as well.

Changes

The fix is simply to add a space between [0] and VP. Upon doing so, the output becomes:

10000
(VP (V‧地 (VH11 大聲) (DE 的)) (VE2 叫) (Di 著))

Feel free to apply this fix manually, or use a diff tool to compare the two zips.

stevenbird commented 2 years ago

@tomaarsen – thanks