Closed shivanraptor closed 11 months ago
Confirming that I can reproduce the same error myself. The upstream CHILDES data must have been updated recently. I'll have to dig into these new annotation cases that my CHAT parser cannot handle and update the parser. Thank you for reporting this!
Hello! It looks like the upstream CHILDES and TalkBank data has been updated/fixed. I've just checked that except for "Paidologos Corpus: Cantonese" (as of this writing, accessing https://phonbank.talkbank.org returns an error), pycantonese can load and successfully parse the datasets listed in the pycantonese documentation without crashing.
Because by default downloaded data is cached on your local drive, if you still use the same machine/system/etc. when you first created this issue, you may still have the previously downloaded yet "faulty" Yip-Matthews corpus copy on disk. To force re-downloading, rather than the convenience function read_chat()
which doesn't expose many arguments, use CHATReader.from_zip()
that has the boolean use_cached
argument (default is True
, and you'd want to set it to False
in this case):
import pycantonese
url = "https://childes.talkbank.org/data/Biling/YipMatthews.zip"
corpus = pycantonese.CHATReader.from_zip(url, use_cached=False)
After you've used CHATReader.from_zip()
for a given URL once, you can switch back to read_chat()
for the same URL to use the cached data and skip re-downloading if you so choose.
Hope this helps! Closing this issue as resolved.
Describe the bug When I try to use the Yip-Matthews Bilingual Corpus, the following error occurs:
To reproduce
Expected behavior Expected the corpus can be used without error, just like Child Heritage Chinese Corpus, Guthrie Bilingual Corpus, HKU-70 Corpus, Lee-Wong-Leung Corpus, Leo Corpus and Paidologos Corpus: Cantonese.
All links are checked, only the Yip-Matthews Bilingual Corpus shows an error.
System (please complete the following information):
Additional context Running in Jupyterhub