goru001 / inltk

Natural Language Toolkit for Indic Languages aims to provide out of the box support for various NLP tasks that an application developer might need
https://inltk.readthedocs.io
MIT License
813 stars 164 forks source link

Getting Runtime error while calling the tokenizer #93

Open patilaum opened 1 year ago

patilaum commented 1 year ago

Hi, Thanks for the great repo.

Getting following error

Traceback (most recent call last):
  File "marathi_support_file.py", line 241, in <module>
    print(tokenize(hindi_text, "mr"))
  File "/home/aum/my_tensorflow/marenv/lib/python3.7/site-packages/inltk/inltk.py", line 62, in tokenize
    tok = LanguageTokenizer(language_code)
  File "/home/aum/my_tensorflow/marenv/lib/python3.7/site-packages/inltk/tokenizer.py", line 14, in __init__
    self.base = EnglishTokenizer(lang) if lang == LanguageCodes.english else IndicTokenizer(lang)
  File "/home/aum/my_tensorflow/marenv/lib/python3.7/site-packages/inltk/tokenizer.py", line 63, in __init__
    self.sp.Load(str(model_path))
  File "/home/aum/my_tensorflow/marenv/lib/python3.7/site-packages/sentencepiece/__init__.py", line 905, in Load
    return self.LoadFromFile(model_file)
  File "/home/aum/my_tensorflow/marenv/lib/python3.7/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())] 

while running following code snippet

from inltk.inltk import setup

setup('mr')

from inltk.inltk import tokenize

hindi_text = """संभाजीनगरमध्ये घडलेली घटना दुर्दैवी आहे. काही लोकांकडून भडकाऊ भाषण देऊन परिस्थिती चिघळवण्याचा प्रयत्न 
सुरू आहे. अशा परिस्थितीत काय बोलावं, याचं भान प्रत्येकाने ठेवायला हवं. सर्वांनी शांतता राखायला हवी. 
आपलं शहर शांत ठेवण्याची जबाबदारी प्रत्येकाची आहे. या घटनेला कोणी राजकीय रंग देत असतील तर यापेक्षा जास्त दुर्दैवी काहीही नाही, 
अशी प्रतिक्रिया देवेंद्र फडणवीस यांनी दिली."""
print(tokenize(hindi_text, "mr"))

I thought it was issue with version of torch, so I install python3.7 and install torch 0.3.0+cpu on virtualenv of python3.7, but still getting same issue.

Can you please help me with this?