visual-layer / fastdup

fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data operations costs at an unparalleled scale.
Other
1.52k stars 74 forks source link

[Bug]: OCR notebook failing with Unicode issues from pandas #249

Closed dnth closed 10 months ago

dnth commented 11 months ago

What happened?

I was trying to run the OCR sample notebook and find there are issues with the Unicode decoding.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 68273: invalid continuation byte

image

What did you expect to see?

No response

What version of fastdup were you runnning on?

1.33

What version of Python were you running on?

Python 3.10

Operating System

Google Colab

Reproduction steps

Run - https://colab.research.google.com/drive/1XvRkN4tCcW3K9J4UlUBIqm8Z2orJfFvp?usp=sharing

Relevant log output

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/fastdup/sentry.py", line 132, in inner_function
    ret = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/fastdup/fastdup_controller.py", line 533, in run
    if fastdup.run(self._set_fastdup_input(), work_dir=str(self._work_dir), **fastdup_kwargs) != 0:
  File "/usr/local/lib/python3.10/dist-packages/fastdup/__init__.py", line 679, in run
    out_df = pd.read_csv(local_file)[[
  File "/usr/local/lib/python3.10/dist-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py", line 950, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py", line 605, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py", line 1442, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py", line 1753, in _make_engine
    return mapping[engine](f, **self.options)
  File "/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/c_parser_wrapper.py", line 79, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 547, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 636, in pandas._libs.parsers.TextReader._get_header
  File "pandas/_libs/parsers.pyx", line 852, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1965, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 68273: invalid continuation byte
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-9-b5abbcba4415> in <cell line: 3>()
      1 import fastdup
      2 fd = fastdup.create(input_dir='./frames')
----> 3 fd.run(bounding_box='ocr', num_images=100)

15 frames
/usr/local/lib/python3.10/dist-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 68273: invalid continuation byte

Attach a screenshot [Optional]

No response

Contact Details [Optional]

No response

dbickson commented 11 months ago

Hi @dnth please send me the local file that was failed to read?

dbickson commented 10 months ago

Please reopen in case this is reproducable