[Bug]: OCR notebook failing with Unicode issues from pandas

dnth commented 11 months ago

What happened?

I was trying to run the OCR sample notebook and find there are issues with the Unicode decoding.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 68273: invalid continuation byte

What did you expect to see?

No response

What version of fastdup were you runnning on?

1.33

What version of Python were you running on?

Python 3.10

Operating System

Google Colab

Reproduction steps

Run - https://colab.research.google.com/drive/1XvRkN4tCcW3K9J4UlUBIqm8Z2orJfFvp?usp=sharing

Relevant log output

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/fastdup/sentry.py", line 132, in inner_function
    ret = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/fastdup/fastdup_controller.py", line 533, in run
    if fastdup.run(self._set_fastdup_input(), work_dir=str(self._work_dir), **fastdup_kwargs) != 0:
  File "/usr/local/lib/python3.10/dist-packages/fastdup/__init__.py", line 679, in run
    out_df = pd.read_csv(local_file)[[
  File "/usr/local/lib/python3.10/dist-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py", line 950, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py", line 605, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py", line 1442, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py", line 1753, in _make_engine
    return mapping[engine](f, **self.options)
  File "/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/c_parser_wrapper.py", line 79, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 547, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 636, in pandas._libs.parsers.TextReader._get_header
  File "pandas/_libs/parsers.pyx", line 852, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1965, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 68273: invalid continuation byte
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-9-b5abbcba4415> in <cell line: 3>()
      1 import fastdup
      2 fd = fastdup.create(input_dir='./frames')
----> 3 fd.run(bounding_box='ocr', num_images=100)

15 frames
/usr/local/lib/python3.10/dist-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 68273: invalid continuation byte

Attach a screenshot [Optional]

No response

Contact Details [Optional]

No response

dbickson commented 11 months ago

Hi @dnth please send me the local file that was failed to read?

dbickson commented 10 months ago

Please reopen in case this is reproducable

visual-layer / fastdup

[Bug]: OCR notebook failing with Unicode issues from pandas #249

What happened?

What did you expect to see?

What version of fastdup were you runnning on?

What version of Python were you running on?

Operating System

Reproduction steps

Relevant log output

Attach a screenshot [Optional]

Contact Details [Optional]