Closed dnth closed 1 year ago
Hi @dnth can you send us the resulting work_dir/atrain_features.dat.csv something has gone wrong with the columns generated, maybe one of the filenames has a "," in it?
@dbickson there's a lot of weird characters in the filenames. Attached as below
Hi @dnth the file you provided loads fine in pandas. The error origin is that pandas fails to read the csv file but the error message is not clear enough to understand which file.
Can you change locally your installed fastdup under site-packages/fastdup/fastdup_controller.py around line 1017 the code to:
for src_file in fastdup_src_files:
fname = work_dir / src_file
if not fname.exists():
continue
try:
df = pd.read_csv(fname)
except EmptyDataError as e:
continue
except Exception as e:
fastdup_capture_exception(f"fatdup_convert_to_relpath {fname}", e)
raise RuntimeError(f"fatdup_convert_to_relpath {fname}")
df['filename'] = df.filename.apply(remove_working_dir)
df.to_csv(work_dir / src_file, index=False)
assert os.path.exists(work_dir / src_file), f"Failed to overwrite file {src_file}"
And then rerun and print the filename and then share this file
p.s. The file could be also atrain_features.bad.csv or atrain_crops.csv in case you run with bounding_boxes, one of them has messed up filenames
To anyone encountering this issue. I found that this error pops up if the file name has ,
in it.
Hi @dnth thanks for finding this! Can you look for a solution for reading french unicode in pandas? i bet someone have already investigated this one. Does the filename as expressed in the atrain_features.dat.csv file matches the filename you see in your local folder? Maybe c side have garbeled the encoding ?
HI @dnth I have fixed this on the c side, please take version 0.901 and let me know if this works on your filenames
I confirm this is fixed in version 0.903. Thanks @dbickson !
Python version -
3.10
fastdup version -0.214
OS -Ubuntu 20.04
I tried to run fastdup on data scraped off google. Here's how I ran it
Here's the error I got.
Did I miss anything?
Here's the notebook I ran - https://github.com/dnth/clean-up-digital-life-fastdup-blogpost/blob/update-v1/fastdup_analyze.ipynb