jacksonllee / pylangacq

Language Acquisition Research Tools
https://pylangacq.org
MIT License
37 stars 18 forks source link

handle duration mark in utterance cleaning #23

Closed timotheecour closed 6 months ago

timotheecour commented 6 months ago

Describe the bug pylangacq.read_chat for "/ca/MICASE/labs/lab500su044.cha" (see https://ca.talkbank.org/data-orig/MICASE/labs/lab500su044.cha)

Relevant CHILDES or TalkBank data If you come across the issue while working with a CHILDES or TalkBank dataset, specifying it (e.g., by providing a URL like this) will greatly help us debug.

To reproduce

import pylangacq
file_cha = f"{paths.dir_talkbank_media}/ca/MICASE/labs/lab500su044.cha"
reader = pylangacq.read_chat(file_cha)

concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/concurrent/futures/process.py", line 205, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/concurrent/futures/process.py", line 205, in <listcomp>
    return [fn(*args) for args in chunk]
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 1455, in _parse_chat_str
    utterances = self._get_utterances(all_tiers)
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 1505, in _get_utterances
    raise ValueError(
ValueError: cannot align the utterance and %mor tiers:
Tiers --
{'S6': 'gimme that &=laughs:SUm xxx [# 0.4] .', '%mor': 'v|give~pro:obj|me pro:dem|that .', '%gra': '1|0|ROOT 2|1|OBJ 3|1|OBJ 4|1|PUNCT'}
Cleaned-up utterance --
gimme that [# 0.4 .
Parsed %mor tier --
['v|give', 'pro:obj|me', 'pro:dem|that', '.']
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

...
   reader = pylangacq.read_chat(file_cha)
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 197, in wrapper
    return func(*args, **kwargs)
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 1882, in read_chat
    return cls.from_files([path], match=match, exclude=exclude, encoding=encoding)
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 197, in wrapper
    return func(*args, **kwargs)
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 1034, in from_files
    return cls.from_strs(strs, paths, parallel=parallel)
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 197, in wrapper
    return func(*args, **kwargs)
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 995, in from_strs
    reader._parse_chat_strs(strs, ids, parallel)
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 264, in _parse_chat_strs
    self._files = collections.deque(
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/concurrent/futures/process.py", line 575, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
ValueError: cannot align the utterance and %mor tiers:
Tiers --
{'S6': 'gimme that &=laughs:SUm xxx [# 0.4] .', '%mor': 'v|give~pro:obj|me pro:dem|that .', '%gra': '1|0|ROOT 2|1|OBJ 3|1|OBJ 4|1|PUNCT'}
Cleaned-up utterance --
gimme that [# 0.4 .
Parsed %mor tier --
['v|give', 'pro:obj|me', 'pro:dem|that', '.']

note

https://github.com/jacksonllee/pylangacq/issues/18 seems related

the code should not abort entirely but instead parse what it can and mark invalid utterancesas having an error (eg None or some error field in utterances), so we can still get partial data

jacksonllee commented 6 months ago

Thank you @timotheecour for reporting the issue! I've just updated the package so that the duration marks in the CHAT transcript data are recognized. The new v0.19.1 release correctly handles the data you used:

In [1]: import pylangacq

In [2]: data = (
   ...:     "*S6: gimme that &=laughs:SUm xxx [# 0.4] .\n"
   ...:     "%mor: v|give~pro:obj|me pro:dem|that .\n"
   ...:     "%gra: 1|0|ROOT 2|1|OBJ 3|1|OBJ 4|1|PUNCT"
   ...: )

In [3]: reader = pylangacq.Reader.from_strs([data])

In [4]: reader.utterances()[0].tokens
Out[4]: 
[Token(word='gimme', pos='v', mor='give', gra=Gra(dep=1, head=0, rel='ROOT')),
 Token(word='POSTCLITIC', pos='pro:obj', mor='me', gra=Gra(dep=2, head=1, rel='OBJ')),
 Token(word='that', pos='pro:dem', mor='that', gra=Gra(dep=3, head=1, rel='OBJ')),
 Token(word='.', pos='.', mor='', gra=Gra(dep=4, head=1, rel='PUNCT'))]