hplt-project / sacremoses

Python port of Moses tokenizer, truecaser and normalizer
MIT License
486 stars 59 forks source link

Regular expression error of penn_tokenize #151

Open speedcell4 opened 3 months ago

speedcell4 commented 3 months ago
from sacremoses import MosesTokenizer

print(MosesTokenizer(lang='en').penn_tokenize("-LRB- This is very nice -RRB-"))

I got the following error. And I found changing lang='en' to lang='zh' doesn't solve the problem.

Traceback (most recent call last):
  File ".../scratches/test.py", line 3, in <module>
    print(MosesTokenizer(lang='en').penn_tokenize("-LRB- This is very nice -RRB-"))
  File ".../python3.9/site-packages/sacremoses/tokenize.py", line 423, in penn_tokenize
    text = regexp.sub(substitution, text)
AttributeError: 'str' object has no attribute 'sub'

I think the problem is here, since it is a str, not a compiled pattern

https://github.com/hplt-project/sacremoses/blob/65543c34baf589f30260488d882d0060abaa4087/sacremoses/tokenize.py#L93-L96

jelmervdl commented 3 months ago

Looks like there's two bugs here: 1) that expression not being compiled, and 2) changing lang doesn't update the penn expressions as you'd expect

speedcell4 commented 3 months ago

exactly, two bugs