milvus-io / milvus-model

The embedding/reranking model zoo help user to convert their unstructured data into embeedings
Apache License 2.0
15 stars 8 forks source link

UnicodeDecodeError for "lang.yaml" #17

Closed fornitroll closed 2 months ago

fornitroll commented 2 months ago

In the BM25 indexing, the line:

analyzer = build_default_analyzer(language="en")

gives an error (full traceback at the end of the message):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2370: character maps to

After deleting other languges from lang.yaml everything works smoothly. So please check the lang.yaml file and code to read it. Also we need to load a stopwords from nltk library (nltk.download('stopwords')) before bm25 running otherwise we get an error. This should be in the readme somewhere. Please correct me if I missed something from documentation.

Full traceback:

UnicodeDecodeError Traceback (most recent call last) Cell In[40], line 2 1 with Path(str(filepath)).open() as file: ----> 2 config = yaml.safe_load(file) 3 lang_config = config.get('en')

File c:\work\agi\venv\Lib\site-packages\yaml__init__.py:125, in safe_load(stream) 117 def safe_load(stream): 118 """ 119 Parse the first YAML document in a stream 120 and produce the corresponding Python object. (...) 123 to be safe for untrusted input. 124 """ --> 125 return load(stream, SafeLoader)

File c:\work\agi\venv\Lib\site-packages\yaml__init__.py:79, in load(stream, Loader) 74 def load(stream, Loader): 75 """ 76 Parse the first YAML document in a stream 77 and produce the corresponding Python object. 78 """ ---> 79 loader = Loader(stream) 80 try: 81 return loader.get_single_data()

File c:\work\agi\venv\Lib\site-packages\yaml\loader.py:34, in SafeLoader.init(self, stream) 33 def init(self, stream): ---> 34 Reader.init(self, stream) 35 Scanner.init(self) 36 Parser.init(self)

File c:\work\agi\venv\Lib\site-packages\yaml\reader.py:85, in Reader.init(self, stream) 83 self.eof = False 84 self.raw_buffer = None ---> 85 self.determine_encoding()

File c:\work\agi\venv\Lib\site-packages\yaml\reader.py:124, in Reader.determine_encoding(self) 122 def determine_encoding(self): 123 while not self.eof and (self.raw_buffer is None or len(self.raw_buffer) < 2): --> 124 self.update_raw() 125 if isinstance(self.raw_buffer, bytes): 126 if self.raw_buffer.startswith(codecs.BOM_UTF16_LE):

File c:\work\agi\venv\Lib\site-packages\yaml\reader.py:178, in Reader.update_raw(self, size) 177 def update_raw(self, size=4096): --> 178 data = self.stream.read(size) 179 if self.raw_buffer is None: 180 self.raw_buffer = data

File ~\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final) 22 def decode(self, input, final=False): ---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 468: character maps to

wxywb commented 2 months ago

It seems you are using Windows OS, can you change this line https://github.com/milvus-io/milvus-model/blob/f7d43b252b942dbcd48cd168efae30ad85cc6271/milvus_model/sparse/bm25/tokenizers.py#L179 to

with Path(filepath).open(encoding='utf-8') as file:

If this could solve your issue, I would update this file.

fornitroll commented 2 months ago

This change solve the issue. Thank you!

wxywb commented 2 months ago

https://github.com/milvus-io/milvus-model/blob/9a746d6590deb3dac6b02cf604d6c60231f4c4f6/milvus_model/sparse/bm25/tokenizers.py#L185 new release updated this line.