Closed timotheecour closed 6 months ago
Thank you @timotheecour for reporting the issue! I've just updated the package so that the duration marks in the CHAT transcript data are recognized. The new v0.19.1 release correctly handles the data you used:
In [1]: import pylangacq
In [2]: data = (
...: "*S6: gimme that &=laughs:SUm xxx [# 0.4] .\n"
...: "%mor: v|give~pro:obj|me pro:dem|that .\n"
...: "%gra: 1|0|ROOT 2|1|OBJ 3|1|OBJ 4|1|PUNCT"
...: )
In [3]: reader = pylangacq.Reader.from_strs([data])
In [4]: reader.utterances()[0].tokens
Out[4]:
[Token(word='gimme', pos='v', mor='give', gra=Gra(dep=1, head=0, rel='ROOT')),
Token(word='POSTCLITIC', pos='pro:obj', mor='me', gra=Gra(dep=2, head=1, rel='OBJ')),
Token(word='that', pos='pro:dem', mor='that', gra=Gra(dep=3, head=1, rel='OBJ')),
Token(word='.', pos='.', mor='', gra=Gra(dep=4, head=1, rel='PUNCT'))]
Describe the bug pylangacq.read_chat for "/ca/MICASE/labs/lab500su044.cha" (see https://ca.talkbank.org/data-orig/MICASE/labs/lab500su044.cha)
Relevant CHILDES or TalkBank data If you come across the issue while working with a CHILDES or TalkBank dataset, specifying it (e.g., by providing a URL like this) will greatly help us debug.
To reproduce
note
https://github.com/jacksonllee/pylangacq/issues/18 seems related
the code should not abort entirely but instead parse what it can and mark invalid utterancesas having an error (eg None or some error field in utterances), so we can still get partial data