As you can see, the format is slightly off for the middle line - There is no space between [0] and VP.
This causes issues when trying to load the sinica_treebank parsed corpus in NLTK:
from nltk.corpus import sinica_treebank
parsed_sents = sinica_treebank.parsed_sents()
full = list(parsed_sents)
print(len(full))
print(full[15])
throws:
Traceback (most recent call last):
File "[sic]\nltk_2467.py", line 4, in <module>
full = list(parsed_sents)
File "[sic]\nltk\corpus\reader\util.py", line 240, in __len__
for tok in self.iterate_from(self._toknum[-1]):
File "[sic]\nltk\corpus\reader\util.py", line 306, in iterate_from
tokens = self.read_block(self._stream)
File "[sic]\nltk\corpus\reader\api.py", line 513, in _read_parsed_sent_block
return list(filter(None, [self._parse(t) for t in self._read_block(stream)]))
File "[sic]\nltk\corpus\reader\api.py", line 513, in <listcomp>
return list(filter(None, [self._parse(t) for t in self._read_block(stream)]))
File "[sic]\nltk\corpus\reader\sinica_treebank.py", line 64, in _parse
return sinica_parse(sent)
File "[sic]\nltk\tree.py", line 1710, in sinica_parse
return Tree.fromstring(treebank_string, remove_empty_top_bracketing=True)
File "[sic]\nltk\tree.py", line 693, in fromstring
cls._parse_error(s, "end-of-string", open_b)
File "[sic]\nltk\tree.py", line 735, in _parse_error
raise ValueError(msg)
ValueError: Tree.read(): expected '(' but got 'end-of-string'
at index 1.
" "
^
This has been described in nltk/nltk#2467 as well.
Changes
The fix is simply to add a space between [0] and VP. Upon doing so, the output becomes:
10000
(VP (V‧地 (VH11 大聲) (DE 的)) (VE2 叫) (Di 著))
Feel free to apply this fix manually, or use a diff tool to compare the two zips.
Hello!
Pull request overview
parsed
file insinica_treebank.zip
.Details
See this snippet from
parsed
insinica_treebank.zip
:(lines 5354-5353)
As you can see, the format is slightly off for the middle line - There is no space between
[0]
andVP
. This causes issues when trying to load the sinica_treebank parsed corpus in NLTK:throws:
This has been described in nltk/nltk#2467 as well.
Changes
The fix is simply to add a space between
[0]
andVP
. Upon doing so, the output becomes:Feel free to apply this fix manually, or use a diff tool to compare the two zips.