ParserError: Error tokenizing data. C error: Expected 3 fields in line 6, saw 4

dnth commented 1 year ago

Python version - 3.10 fastdup version - 0.214 OS - Ubuntu 20.04

I tried to run fastdup on data scraped off google. Here's how I ran it

import fastdup
work_dir = "./fastdup_report"
images_dir = "./images"

fd = fastdup.create(work_dir, images_dir)
fd.run(verbose=True)

Here's the error I got.

Traceback (most recent call last):
  File "/home/dnth/anaconda3/envs/fastdupv1/lib/python3.10/site-packages/fastdup/sentry.py", line 114, in inner_function
    ret = func(*args, **kwargs)
  File "/home/dnth/anaconda3/envs/fastdupv1/lib/python3.10/site-packages/fastdup/fastdup_controller.py", line 303, in run
    fastdup_convert_to_relpath(self._work_dir, self._filename_prefix)
  File "/home/dnth/anaconda3/envs/fastdupv1/lib/python3.10/site-packages/fastdup/fastdup_controller.py", line 1011, in fastdup_convert_to_relpath
    df = pd.read_csv(fname)
  File "/home/dnth/anaconda3/envs/fastdupv1/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/home/dnth/anaconda3/envs/fastdupv1/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/home/dnth/anaconda3/envs/fastdupv1/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/dnth/anaconda3/envs/fastdupv1/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 611, in _read
    return parser.read(nrows)
  File "/home/dnth/anaconda3/envs/fastdupv1/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1778, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/home/dnth/anaconda3/envs/fastdupv1/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 230, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 808, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 866, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 852, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1973, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 6, saw 4

---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
Cell In[7], line 5
      2 images_dir = "./images"
      4 fd = fastdup.create(work_dir, images_dir)
----> 5 fd.run(verbose=True)

File ~/anaconda3/envs/fastdupv1/lib/python3.10/site-packages/fastdup/engine.py:156, in Fastdup.run(self, input_dir, annotations, embeddings, subset, data_type, overwrite, model_path, distance, nearest_neighbors_k, threshold, outlier_percentile, num_threads, num_images, verbose, license, high_accuracy, cc_threshold, **kwargs)
    153     fastdup_func_params['model_path'] = model_path
    154 fastdup_func_params.update(kwargs)
--> 156 super().run(annotations=annotations, input_dir=input_dir, subset=subset, data_type=data_type,
    157             overwrite=overwrite, embeddings=embeddings, **fastdup_func_params)

File ~/anaconda3/envs/fastdupv1/lib/python3.10/site-packages/fastdup/sentry.py:120, in v1_sentry_handler.<locals>.inner_function(*args, **kwargs)
    118 except Exception as ex:
    119     fastdup_capture_exception(f"V1:{func.__name__}", ex)
--> 120     raise ex

File ~/anaconda3/envs/fastdupv1/lib/python3.10/site-packages/fastdup/sentry.py:114, in v1_sentry_handler.<locals>.inner_function(*args, **kwargs)
    112 try:
    113     start_time = time.time()
--> 114     ret = func(*args, **kwargs)
    115     fastdup_performance_capture(f"V1:{func.__name__}", start_time)
    116     return ret

File ~/anaconda3/envs/fastdupv1/lib/python3.10/site-packages/fastdup/fastdup_controller.py:303, in FastdupController.run(self, input_dir, annotations, subset, embeddings, data_type, overwrite, print_summary, **fastdup_kwargs)
    301 # run fastdup - create embeddings
    302 fastdup.run(self._set_fastdup_input(), work_dir=str(self._work_dir), **fastdup_kwargs)
--> 303 fastdup_convert_to_relpath(self._work_dir, self._filename_prefix)
    305 # post process - map fastdup-id to image (for bbox this is done in self._set_fastdup_input)
    306 if self._dtype == FD.IMG or self._run_mode == FD.MODE_CROP:

File ~/anaconda3/envs/fastdupv1/lib/python3.10/site-packages/fastdup/fastdup_controller.py:1011, in fastdup_convert_to_relpath(work_dir, input_dir)
   1009     continue
   1010 try:
-> 1011     df = pd.read_csv(fname)
   1013 except EmptyDataError as e:
   1014     print(f'{src_file} is empty')

File ~/anaconda3/envs/fastdupv1/lib/python3.10/site-packages/pandas/util/_decorators.py:211, in deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
    209     else:
    210         kwargs[new_arg_name] = new_arg_value
--> 211 return func(*args, **kwargs)

File ~/anaconda3/envs/fastdupv1/lib/python3.10/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    325 if len(args) > num_allow_args:
    326     warnings.warn(
    327         msg.format(arguments=_format_argument_list(allow_args)),
    328         FutureWarning,
    329         stacklevel=find_stack_level(),
    330     )
--> 331 return func(*args, **kwargs)

File ~/anaconda3/envs/fastdupv1/lib/python3.10/site-packages/pandas/io/parsers/readers.py:950, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    935 kwds_defaults = _refine_defaults_read(
    936     dialect,
    937     delimiter,
   (...)
    946     defaults={"delimiter": ","},
    947 )
    948 kwds.update(kwds_defaults)
--> 950 return _read(filepath_or_buffer, kwds)

File ~/anaconda3/envs/fastdupv1/lib/python3.10/site-packages/pandas/io/parsers/readers.py:611, in _read(filepath_or_buffer, kwds)
    608     return parser
    610 with parser:
--> 611     return parser.read(nrows)

File ~/anaconda3/envs/fastdupv1/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1778, in TextFileReader.read(self, nrows)
   1771 nrows = validate_integer("nrows", nrows)
   1772 try:
   1773     # error: "ParserBase" has no attribute "read"
   1774     (
   1775         index,
   1776         columns,
   1777         col_dict,
-> 1778     ) = self._engine.read(  # type: ignore[attr-defined]
   1779         nrows
   1780     )
   1781 except Exception:
   1782     self.close()

File ~/anaconda3/envs/fastdupv1/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py:230, in CParserWrapper.read(self, nrows)
    228 try:
    229     if self.low_memory:
--> 230         chunks = self._reader.read_low_memory(nrows)
    231         # destructive to chunks
    232         data = _concatenate_chunks(chunks)

File ~/anaconda3/envs/fastdupv1/lib/python3.10/site-packages/pandas/_libs/parsers.pyx:808, in pandas._libs.parsers.TextReader.read_low_memory()

File ~/anaconda3/envs/fastdupv1/lib/python3.10/site-packages/pandas/_libs/parsers.pyx:866, in pandas._libs.parsers.TextReader._read_rows()

File ~/anaconda3/envs/fastdupv1/lib/python3.10/site-packages/pandas/_libs/parsers.pyx:852, in pandas._libs.parsers.TextReader._tokenize_rows()

File ~/anaconda3/envs/fastdupv1/lib/python3.10/site-packages/pandas/_libs/parsers.pyx:1973, in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 3 fields in line 6, saw 4

Did I miss anything?

Here's the notebook I ran - https://github.com/dnth/clean-up-digital-life-fastdup-blogpost/blob/update-v1/fastdup_analyze.ipynb

dbickson commented 1 year ago

Hi @dnth can you send us the resulting work_dir/atrain_features.dat.csv something has gone wrong with the columns generated, maybe one of the filenames has a "," in it?

dnth commented 1 year ago

@dbickson there's a lot of weird characters in the filenames. Attached as below

atrain_features.dat.csv

dbickson commented 1 year ago

Hi @dnth the file you provided loads fine in pandas. The error origin is that pandas fails to read the csv file but the error message is not clear enough to understand which file.

Can you change locally your installed fastdup under site-packages/fastdup/fastdup_controller.py around line 1017 the code to:

          for src_file in fastdup_src_files:
        fname = work_dir / src_file
        if not fname.exists():
            continue
        try:
            df = pd.read_csv(fname)
        except EmptyDataError as e:
            continue

        except Exception as e:
            fastdup_capture_exception(f"fatdup_convert_to_relpath {fname}", e)
            raise RuntimeError(f"fatdup_convert_to_relpath {fname}")

        df['filename'] = df.filename.apply(remove_working_dir)
        df.to_csv(work_dir / src_file, index=False)
        assert os.path.exists(work_dir / src_file), f"Failed to overwrite file {src_file}"

And then rerun and print the filename and then share this file

dbickson commented 1 year ago

p.s. The file could be also atrain_features.bad.csv or atrain_crops.csv in case you run with bounding_boxes, one of them has messed up filenames

dnth commented 1 year ago

To anyone encountering this issue. I found that this error pops up if the file name has , in it.

image (2) image (1)

dbickson commented 1 year ago

Hi @dnth thanks for finding this! Can you look for a solution for reading french unicode in pandas? i bet someone have already investigated this one. Does the filename as expressed in the atrain_features.dat.csv file matches the filename you see in your local folder? Maybe c side have garbeled the encoding ?

dbickson commented 1 year ago

HI @dnth I have fixed this on the c side, please take version 0.901 and let me know if this works on your filenames

dnth commented 1 year ago

I confirm this is fixed in version 0.903. Thanks @dbickson !

visual-layer / fastdup

ParserError: Error tokenizing data. C error: Expected 3 fields in line 6, saw 4 #98