jacksonllee / pycantonese

Cantonese Linguistics and NLP
https://pycantonese.org
MIT License
361 stars 39 forks source link

Corpus not loading in Pyodide #52

Open taviowong opened 2 months ago

taviowong commented 2 months ago

Describe the bug Pyodide is a tool for running Python packages in the browser. In its current state, pycantonese cannot be run in Pyodide due to the use of multi-threading during data loading of corpus.

To reproduce

  1. Go to the online REPL at https://pyodide.org/en/stable/console.html
  2. Run the following script
    >>> import micropip
    >>> await micropip.install('setuptools')
    >>> await micropip.install('pycantonese')
    >>> import pycantonese
    >>> pycantonese.segment('但願人長久,千裡共嬋娟')
  3. An error is thrown: "RuntimeError: can't start new thread". Full stack trace as follows.
    Traceback (most recent call last):
    File "<console>", line 1, in <module>
    File "/lib/python3.12/site-packages/pycantonese/parsing.py", line 170, in parse_text
    _get_utterance(sent, segment_kwargs, pos_tag_kwargs, participant)
    File "/lib/python3.12/site-packages/pycantonese/parsing.py", line 56, in _get_utterance
    words, tags, jps = _parse_text(unparsed_sent, segment_kwargs, pos_tag_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/lib/python3.12/site-packages/pycantonese/parsing.py", line 27, in _parse_text
    chars_jps = characters_to_jyutping(text, **(segment_kwargs or {}))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/lib/python3.12/site-packages/pycantonese/jyutping/characters.py", line 101, in characters_to_jyutping
    words_to_jyutping, chars_to_jyutping = _get_words_characters_to_jyutping()
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/lib/python3.12/site-packages/pycantonese/jyutping/characters.py", line 14, in _get_words_characters_to_jyutping
    corpus = hkcancor()
             ^^^^^^^^^^
    File "/lib/python3.12/site-packages/pycantonese/corpus.py", line 396, in hkcancor
    reader = _HKCanCorReader.from_dir(data_dir)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/lib/python3.12/site-packages/pylangacq/chat.py", line 187, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
    File "/lib/python3.12/site-packages/pylangacq/chat.py", line 1057, in from_dir
    return cls.from_files(
           ^^^^^^^^^^^^^^^
    File "/lib/python3.12/site-packages/pylangacq/chat.py", line 187, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
    File "/lib/python3.12/site-packages/pylangacq/chat.py", line 1005, in from_files
    strs = list(executor.map(_open_file, paths))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/lib/python312.zip/concurrent/futures/_base.py", line 608, in map
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
          ^^^^^^^^^^^^^^^^^^^^^^
    File "/lib/python312.zip/concurrent/futures/thread.py", line 179, in submit
    self._adjust_thread_count()
    File "/lib/python312.zip/concurrent/futures/thread.py", line 202, in _adjust_thread_count
    t.start()
    File "/lib/python312.zip/threading.py", line 992, in start
    _start_new_thread(self._bootstrap, ())
    RuntimeError: can't start new thread

Expected behavior The sentence can be segmented without error: ['但願', '人', '長久', ',', '千', '裡', '共', '嬋娟']

System (please complete the following information):

Additional context