fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data operations costs at an unparalleled scale.
Other
1.52k
stars
74
forks
source link
[Bug]: OCR notebook failing with Unicode issues from pandas #249
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/fastdup/sentry.py", line 132, in inner_function
ret = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/fastdup/fastdup_controller.py", line 533, in run
if fastdup.run(self._set_fastdup_input(), work_dir=str(self._work_dir), **fastdup_kwargs) != 0:
File "/usr/local/lib/python3.10/dist-packages/fastdup/__init__.py", line 679, in run
out_df = pd.read_csv(local_file)[[
File "/usr/local/lib/python3.10/dist-packages/pandas/util/_decorators.py", line 211, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py", line 950, in read_csv
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py", line 605, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py", line 1442, in __init__
self._engine = self._make_engine(f, self.engine)
File "/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py", line 1753, in _make_engine
return mapping[engine](f, **self.options)
File "/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/c_parser_wrapper.py", line 79, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 547, in pandas._libs.parsers.TextReader.__cinit__
File "pandas/_libs/parsers.pyx", line 636, in pandas._libs.parsers.TextReader._get_header
File "pandas/_libs/parsers.pyx", line 852, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 1965, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 68273: invalid continuation byte
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-9-b5abbcba4415> in <cell line: 3>()
1 import fastdup
2 fd = fastdup.create(input_dir='./frames')
----> 3 fd.run(bounding_box='ocr', num_images=100)
15 frames
/usr/local/lib/python3.10/dist-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 68273: invalid continuation byte
What happened?
I was trying to run the OCR sample notebook and find there are issues with the Unicode decoding.
What did you expect to see?
No response
What version of fastdup were you runnning on?
1.33
What version of Python were you running on?
Python 3.10
Operating System
Google Colab
Reproduction steps
Run - https://colab.research.google.com/drive/1XvRkN4tCcW3K9J4UlUBIqm8Z2orJfFvp?usp=sharing
Relevant log output
Attach a screenshot [Optional]
No response
Contact Details [Optional]
No response