stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.68k stars 355 forks source link

UnicodeDecodeError when loading queries/collection. #250

Open WeeMan1990 opened 9 months ago

WeeMan1990 commented 9 months ago

Following the steps in https://github.com/stanford-futuredata/ColBERT/blob/main/docs/intro.ipynb using Anaconda/Jupyter lab a UnicodeDecodeError was thrown at the step of loading the queries/collection.

Output: [Sep 15, 15:20:46] #> Got 417 queries. All QIDs are unique. [Sep 15, 15:20:46] #> Loading collection...

UnicodeDecodeError Traceback (most recent call last) Cell In[3], line 8 5 collection = os.path.join(dataroot, dataset, datasplit, 'collection.tsv') 7 queries = Queries(path=queries) ----> 8 collection = Collection(path=collection) 10 f'Loaded {len(queries)} queries and {len(collection):,} passages' File ~\ColBERT\colbert\data\collection.py:17, in Collection.init(self, path, data) 15 def init(self, path=None, data=None): 16 self.path = path ---> 17 self.data = data or self._load_file(path) File ~\ColBERT\colbert\data\collection.py:33, in Collection._load_file(self, path) 31 def _load_file(self, path): 32 self.path = path ---> 33 return self._load_tsv(path) if path.endswith('.tsv') else self._load_jsonl(path) File ~\ColBERT\colbert\data\collection.py:36, in Collection._load_tsv(self, path) 35 def _load_tsv(self, path): ---> 36 return load_collection(path)

_File ~\ColBERT\colbert\evaluation\loaders.py:161, in load_collection(collectionpath) *_158 collection = [] 160 with open(collection_path) as f: --> 161 for line_idx, line in enumerate(f): 162 if line_idx % (10001000) == 0: 163 print(f'{lineidx // 1000 // 1000}M', end=' ', flush=True)**

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 4837: character maps to

It seems like i could solve this by changing line 160 in evaluation/loaders.py from: with open(collection_path) as f: To: with open(collection_path, encoding='utf8') as f:

Post edit of loaders.py (Row 160) output: [Sep 15, 15:30:41] #> Loading the queries from lotte\lifestyle\dev\questions.search.tsv ... [Sep 15, 15:30:41] #> Got 417 queries. All QIDs are unique. [Sep 15, 15:30:41] #> Loading collection...

'Loaded 417 queries and 268,893 passages