marella / chatdocs

Chat with your documents offline using AI.
MIT License
684 stars 99 forks source link

Add UTF-8 support to CSV loader #10

Closed ianmeinert closed 1 year ago

ianmeinert commented 1 year ago

Description

Fixes CSV files with unicode characters which errors with UnicodeDecodeError: 'charmap' codec can't decode byte, preventing further training. I've found this issue while previously working locally with privateGPT.

C:\>chatdocs add source_documents
Creating new vectorstore
Loading documents from source_documents
Loading new documents:  17%|███▋                  | 1/6 [00:07<00:36,  7.35s/it]
RemoteTraceback:
Traceback (most recent call last):
  File "C:\Python310\lib\multiprocessing\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "C:\Python310\lib\site-packages\chatdocs\add.py", line 74, in
load_single_document
    return loader.load()
  File "C:\Python310\lib\site-packages\langchain\document_loaders\csv_loader.py", line 51, in load
    for i, row in enumerate(csv_reader):
  File "C:\Python310\lib\csv.py", line 110, in __next__
    self.fieldnames
  File "C:\Python310\lib\csv.py", line 97, in fieldnames
    self._fieldnames = next(self.reader)
  File "C:\Python310\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 4693: character maps to <undefined>

Type of change

Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

After making the code change and installing the package, I added a documents folder with a CSV which contains unicode characters. I tested with an internal document and again, separately, with the Titanic passenger information.

C:\>chatdocs add source_documents
Creating new vectorstore
Loading documents from docs
Loading new documents: 100%|██████████████████████| 1/1 [00:03<00:00,  3.31s/it]
Loaded 1932 new documents from docs
Creating embeddings. May take a few minutes...
marella commented 1 year ago

Thanks for the PR. This is released in the latest version 0.2.3

I also added a test case to the automated tests that fails on Windows before the fix but passes after the fix.