medspacy / PyRuSH

init
MIT License
3 stars 2 forks source link

Failed to load default rush_rules.tsv on Windows 10 traditional chinese version #2

Open ivantyj opened 1 year ago

ivantyj commented 1 year ago

Program throws by calling medspacy.load() with default config.

(test_cxr2) λ python
Python 3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 23:03:10) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import medspacy
>>> nlp = medspacy.load()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\ivantsai\.virtualenvs\test_cxr2\lib\site-packages\medspacy\util.py", line 100, in load
    nlp.add_pipe("medspacy_pyrush", config={"rules_path": pyrush_path})
  File "C:\Users\ivantsai\.virtualenvs\test_cxr2\lib\site-packages\spacy\language.py", line 801, in add_pipe
    pipe_component = self.create_pipe(
  File "C:\Users\ivantsai\.virtualenvs\test_cxr2\lib\site-packages\spacy\language.py", line 680, in create_pipe
    resolved = registry.resolve(cfg, validate=validate)
  File "C:\Users\ivantsai\.virtualenvs\test_cxr2\lib\site-packages\confection\__init__.py", line 728, in resolve
    resolved, _ = cls._make(
  File "C:\Users\ivantsai\.virtualenvs\test_cxr2\lib\site-packages\confection\__init__.py", line 777, in _make
    filled, _, resolved = cls._fill(
  File "C:\Users\ivantsai\.virtualenvs\test_cxr2\lib\site-packages\confection\__init__.py", line 849, in _fill
    getter_result = getter(*args, **kwargs)
  File "C:\Users\ivantsai\.virtualenvs\test_cxr2\lib\site-packages\PyRuSH\PyRuSHSentencizer.py", line 45, in __init__
    self.rush = RuSH(rules=rules_path, max_repeat=max_repeat, auto_fix_gaps=auto_fix_gaps)
  File "C:\Users\ivantsai\.virtualenvs\test_cxr2\lib\site-packages\PyRuSH\RuSH.py", line 84, in __init__
    self.fastner = FastCNER(rules, max_repeat)
  File "C:\Users\ivantsai\.virtualenvs\test_cxr2\lib\site-packages\PyFastNER\FastCNER.py", line 84, in __init__
    self.initiate(rules)
  File "C:\Users\ivantsai\.virtualenvs\test_cxr2\lib\site-packages\PyFastNER\FastCNER.py", line 96, in initiate
    io_utils = IOUtils(rule_str)
  File "C:\Users\ivantsai\.virtualenvs\test_cxr2\lib\site-packages\PyFastNER\IOUtils.py", line 30, in __init__
    self.read(rules, '\t')
  File "C:\Users\ivantsai\.virtualenvs\test_cxr2\lib\site-packages\PyFastNER\IOUtils.py", line 47, in read
    self.parse(csvfile, delimiter)
  File "C:\Users\ivantsai\.virtualenvs\test_cxr2\lib\site-packages\PyFastNER\IOUtils.py", line 55, in parse
    self.parse_iterator(spamreader)
  File "C:\Users\ivantsai\.virtualenvs\test_cxr2\lib\site-packages\PyFastNER\IOUtils.py", line 60, in parse_iterator
    for row in iterator:
UnicodeDecodeError: 'cp950' codec can't decode byte 0xe2 in position 4290: illegal multibyte sequence

It appears that defualt codec cp950 cannot load default rush_rules.tsv from PyRuSH. Got workaround by manually remove special characters and replace default rush_rules.tsv. Still hope devs could helps to fix that.

Modified rush_rules.tsv is attached for someone like me. Replace the one located in your site-packages/resources with the file in rush_rules.zip

jianlins commented 1 year ago

The rule file uses UTF-8 coding. It might be worth having the medspacy explicitly do so instead of using the local system default. @turbosheep Thoughts?