Fixes CSV files with unicode characters which errors with UnicodeDecodeError: 'charmap' codec can't decode byte, preventing further training. I've found this issue while previously working locally with privateGPT.
C:\>chatdocs add source_documents
Creating new vectorstore
Loading documents from source_documents
Loading new documents: 17%|███▋ | 1/6 [00:07<00:36, 7.35s/it]
RemoteTraceback:
Traceback (most recent call last):
File "C:\Python310\lib\multiprocessing\pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "C:\Python310\lib\site-packages\chatdocs\add.py", line 74, in
load_single_document
return loader.load()
File "C:\Python310\lib\site-packages\langchain\document_loaders\csv_loader.py", line 51, in load
for i, row in enumerate(csv_reader):
File "C:\Python310\lib\csv.py", line 110, in __next__
self.fieldnames
File "C:\Python310\lib\csv.py", line 97, in fieldnames
self._fieldnames = next(self.reader)
File "C:\Python310\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 4693: character maps to <undefined>
Type of change
Bug fix (non-breaking change which fixes an issue)
How Has This Been Tested?
After making the code change and installing the package, I added a documents folder with a CSV which contains unicode characters. I tested with an internal document and again, separately, with the Titanic passenger information.
C:\>chatdocs add source_documents
Creating new vectorstore
Loading documents from docs
Loading new documents: 100%|██████████████████████| 1/1 [00:03<00:00, 3.31s/it]
Loaded 1932 new documents from docs
Creating embeddings. May take a few minutes...
Description
Fixes CSV files with unicode characters which errors with
UnicodeDecodeError: 'charmap' codec can't decode byte
, preventing further training. I've found this issue while previously working locally with privateGPT.Type of change
Bug fix (non-breaking change which fixes an issue)
How Has This Been Tested?
After making the code change and installing the package, I added a documents folder with a CSV which contains unicode characters. I tested with an internal document and again, separately, with the Titanic passenger information.