jacksonllee / pycantonese

Cantonese Linguistics and NLP
https://pycantonese.org
MIT License
361 stars 39 forks source link

Parsing Error occurred when using Yip-Matthews Bilingual Corpus #38

Closed shivanraptor closed 11 months ago

shivanraptor commented 1 year ago

Describe the bug When I try to use the Yip-Matthews Bilingual Corpus, the following error occurs:

---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/opt/tljh/user/lib/python3.9/concurrent/futures/process.py", line 243, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/opt/tljh/user/lib/python3.9/concurrent/futures/process.py", line 202, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/opt/tljh/user/lib/python3.9/concurrent/futures/process.py", line 202, in <listcomp>
    return [fn(*args) for args in chunk]
  File "/home/jupyter-raptor/.local/lib/python3.9/site-packages/pylangacq/chat.py", line 1430, in _parse_chat_str
    utterances = self._get_utterances(all_tiers)
  File "/home/jupyter-raptor/.local/lib/python3.9/site-packages/pylangacq/chat.py", line 1449, in _get_utterances
    utterance_line = _clean_utterance(tiermarker_to_line[participant_code])
  File "/home/jupyter-raptor/.local/lib/python3.9/site-packages/pylangacq/_clean_utterance.py", line 195, in _clean_utterance
    utterance = _drop(utterance, "> [/]", "<", ">", "left")
  File "/home/jupyter-raptor/.local/lib/python3.9/site-packages/pylangacq/_clean_utterance.py", line 118, in _drop
    paren_i = _find_paren(
  File "/home/jupyter-raptor/.local/lib/python3.9/site-packages/pylangacq/_clean_utterance.py", line 112, in _find_paren
    raise ValueError(f"no matching paren: {s}, {target}, {opposite}, {direction}")
ValueError: no matching paren: see my babe [/] babe , <, >, left
"""

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Cell In[3], line 7
      5 url = "https://childes.talkbank.org/data/Biling/CHCC.zip"
      6 url = "https://childes.talkbank.org/data/Biling/YipMatthews.zip"
----> 7 corpus = pycantonese.read_chat(url)

File ~/.local/lib/python3.9/site-packages/pylangacq/chat.py:187, in _params_in_docstring.<locals>.real_decorator.<locals>.wrapper(*args, **kwargs)
    185 @functools.wraps(func)
    186 def wrapper(*args, **kwargs):
--> 187     return func(*args, **kwargs)

File ~/.local/lib/python3.9/site-packages/pycantonese/corpus.py:423, in read_chat(path, match, exclude, encoding)
    402 @_params_in_docstring("match", "exclude", "encoding", class_method=False)
    403 def read_chat(
    404     path: str, match: str = None, exclude: str = None, encoding: str = _ENCODING
    405 ) -> CHATReader:
    406     """Read Cantonese CHAT data files.
    407 
    408     Parameters
   (...)
    421     :class:`~pycantonese.CHATReader`
    422     """
--> 423     return pylangacq_read_chat(
    424         path, match=match, exclude=exclude, encoding=encoding, cls=CHATReader
    425     )

File ~/.local/lib/python3.9/site-packages/pylangacq/chat.py:187, in _params_in_docstring.<locals>.real_decorator.<locals>.wrapper(*args, **kwargs)
    185 @functools.wraps(func)
    186 def wrapper(*args, **kwargs):
--> 187     return func(*args, **kwargs)

File ~/.local/lib/python3.9/site-packages/pylangacq/chat.py:1846, in read_chat(path, match, exclude, encoding, cls)
   1844 path_lower = path.lower()
   1845 if path_lower.endswith(".zip"):
-> 1846     return cls.from_zip(path, match=match, exclude=exclude, encoding=encoding)
   1847 elif os.path.isdir(path):
   1848     return cls.from_dir(path, match=match, exclude=exclude, encoding=encoding)

File ~/.local/lib/python3.9/site-packages/pylangacq/chat.py:187, in _params_in_docstring.<locals>.real_decorator.<locals>.wrapper(*args, **kwargs)
    185 @functools.wraps(func)
    186 def wrapper(*args, **kwargs):
--> 187     return func(*args, **kwargs)

File ~/.local/lib/python3.9/site-packages/pylangacq/chat.py:1127, in Reader.from_zip(cls, path, match, exclude, extension, encoding, parallel, use_cached, session)
   1124         with zipfile.ZipFile(zip_path) as zfile:
   1125             zfile.extractall(unzip_dir)
-> 1127     reader = cls.from_dir(
   1128         unzip_dir,
   1129         match=match,
   1130         exclude=exclude,
   1131         extension=extension,
   1132         encoding=encoding,
   1133         parallel=parallel,
   1134     )
   1136 # Unzipped files from `.from_zip` have the unwieldy temp dir in the file path.
   1137 for f in reader._files:

File ~/.local/lib/python3.9/site-packages/pylangacq/chat.py:187, in _params_in_docstring.<locals>.real_decorator.<locals>.wrapper(*args, **kwargs)
    185 @functools.wraps(func)
    186 def wrapper(*args, **kwargs):
--> 187     return func(*args, **kwargs)

File ~/.local/lib/python3.9/site-packages/pylangacq/chat.py:1057, in Reader.from_dir(cls, path, match, exclude, extension, encoding, parallel)
   1055             continue
   1056         file_paths.append(os.path.join(dirpath, filename))
-> 1057 return cls.from_files(
   1058     sorted(file_paths),
   1059     match=match,
   1060     exclude=exclude,
   1061     encoding=encoding,
   1062     parallel=parallel,
   1063 )

File ~/.local/lib/python3.9/site-packages/pylangacq/chat.py:187, in _params_in_docstring.<locals>.real_decorator.<locals>.wrapper(*args, **kwargs)
    185 @functools.wraps(func)
    186 def wrapper(*args, **kwargs):
--> 187     return func(*args, **kwargs)

File ~/.local/lib/python3.9/site-packages/pylangacq/chat.py:1009, in Reader.from_files(cls, paths, match, exclude, encoding, parallel)
   1006 else:
   1007     strs = [_open_file(p) for p in paths]
-> 1009 return cls.from_strs(strs, paths, parallel=parallel)

File ~/.local/lib/python3.9/site-packages/pylangacq/chat.py:187, in _params_in_docstring.<locals>.real_decorator.<locals>.wrapper(*args, **kwargs)
    185 @functools.wraps(func)
    186 def wrapper(*args, **kwargs):
--> 187     return func(*args, **kwargs)

File ~/.local/lib/python3.9/site-packages/pylangacq/chat.py:970, in Reader.from_strs(cls, strs, ids, parallel)
    966     raise ValueError(
    967         f"strs and ids must have the same size: {len(strs)} and {len(ids)}"
    968     )
    969 reader = cls()
--> 970 reader._parse_chat_strs(strs, ids, parallel)
    971 return reader

File ~/.local/lib/python3.9/site-packages/pylangacq/chat.py:254, in Reader._parse_chat_strs(self, strs, file_paths, parallel)
    252 if parallel:
    253     with cf.ProcessPoolExecutor() as executor:
--> 254         self._files = collections.deque(
    255             executor.map(self._parse_chat_str, strs, file_paths)
    256         )
    257 else:
    258     self._files = collections.deque(
    259         self._parse_chat_str(s, f) for s, f in zip(strs, file_paths)
    260     )

File /opt/tljh/user/lib/python3.9/concurrent/futures/process.py:559, in _chain_from_iterable_of_lists(iterable)
    553 def _chain_from_iterable_of_lists(iterable):
    554     """
    555     Specialized implementation of itertools.chain.from_iterable.
    556     Each item in *iterable* should be a list.  This function is
    557     careful not to keep references to yielded objects.
    558     """
--> 559     for element in iterable:
    560         element.reverse()
    561         while element:

File /opt/tljh/user/lib/python3.9/concurrent/futures/_base.py:608, in Executor.map.<locals>.result_iterator()
    605 while fs:
    606     # Careful not to keep a reference to the popped future
    607     if timeout is None:
--> 608         yield fs.pop().result()
    609     else:
    610         yield fs.pop().result(end_time - time.monotonic())

File /opt/tljh/user/lib/python3.9/concurrent/futures/_base.py:438, in Future.result(self, timeout)
    436     raise CancelledError()
    437 elif self._state == FINISHED:
--> 438     return self.__get_result()
    440 self._condition.wait(timeout)
    442 if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]:

File /opt/tljh/user/lib/python3.9/concurrent/futures/_base.py:390, in Future.__get_result(self)
    388 if self._exception:
    389     try:
--> 390         raise self._exception
    391     finally:
    392         # Break a reference cycle with the exception in self._exception
    393         self = None

ValueError: no matching paren: see my babe [/] babe , <, >, left

To reproduce

  1. Execute the following codes:
    import pycantonese
    url = "https://childes.talkbank.org/data/Biling/YipMatthews.zip"
    corpus = pycantonese.read_chat(url)
  2. The above error appears.

Expected behavior Expected the corpus can be used without error, just like Child Heritage Chinese Corpus, Guthrie Bilingual Corpus, HKU-70 Corpus, Lee-Wong-Leung Corpus, Leo Corpus and Paidologos Corpus: Cantonese.

All links are checked, only the Yip-Matthews Bilingual Corpus shows an error.

System (please complete the following information):

Additional context Running in Jupyterhub

jacksonllee commented 1 year ago

Confirming that I can reproduce the same error myself. The upstream CHILDES data must have been updated recently. I'll have to dig into these new annotation cases that my CHAT parser cannot handle and update the parser. Thank you for reporting this!

jacksonllee commented 11 months ago

Hello! It looks like the upstream CHILDES and TalkBank data has been updated/fixed. I've just checked that except for "Paidologos Corpus: Cantonese" (as of this writing, accessing https://phonbank.talkbank.org returns an error), pycantonese can load and successfully parse the datasets listed in the pycantonese documentation without crashing.

Because by default downloaded data is cached on your local drive, if you still use the same machine/system/etc. when you first created this issue, you may still have the previously downloaded yet "faulty" Yip-Matthews corpus copy on disk. To force re-downloading, rather than the convenience function read_chat() which doesn't expose many arguments, use CHATReader.from_zip() that has the boolean use_cached argument (default is True, and you'd want to set it to False in this case):

import pycantonese
url = "https://childes.talkbank.org/data/Biling/YipMatthews.zip"
corpus = pycantonese.CHATReader.from_zip(url, use_cached=False)

After you've used CHATReader.from_zip() for a given URL once, you can switch back to read_chat() for the same URL to use the cached data and skip re-downloading if you so choose.

Hope this helps! Closing this issue as resolved.