google-research / deduplicate-text-datasets

Apache License 2.0
1.09k stars 108 forks source link

called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" } #51

Open bingkunyao opened 2 months ago

bingkunyao commented 2 months ago

Just same as "A full end-to-end single file deduplication example" in readme file, When I tried to run "bash scripts/deduplicate_single_file.sh /home/user/deduplicate-text-datasets/test_reduce/testfile.csv /home/user/deduplicate-text-datasets/test_reduce/test_result 400 4", I encountered the error below:

1

I am sure that the file and the path mentioned both exists. I also run "ulimit -Sn 100000" but it dis not work. Mention that the CSV file is large (About 2.0 GB). Could anyone help me with this problem?

trebedea commented 1 month ago

Do you have a tmp subdirectory in the of the directory where you have the dataset you are "indexing" / using as a reference for checking for duplication?

If you look here in the script creating the suffix array, there is a relative path for tmp/out.table.bin:

https://github.com/google-research/deduplicate-text-datasets/blob/4e9888ac3f95dc4f6169867a04c4c19df02dafe3/scripts/make_suffix_array.py#L91-L95

I guess it is a bug, probably it should have been \tmp as in other places in the codebase. I also bumped into this problem yesterday, maybe it helps you or other people trying to use the tool.