setting of MAX_DUPLICATES should be dynamic

drasill commented 9 years ago

Hi,

thank you for this wonderful tool, it's indeed fast !

I want to use it to parse a 8M files directory tree of photos, where there (may) be a lot of duplicates.

Indeed, there are, I met the default value of MAX_DUPLICATED (8000).

Why is there a limit ? Is it only a warning ? What are the risks of setting it to, for example, 16000 ?

Thanks,

jvirkki commented 9 years ago

Indeed, this should not be a hardcoded limit as the comments note. This should really be dynamic.

There is no risk of setting any higher constant you need, other than the extra memory allocated. As long as that is not a problem it is fine.

Note that this is the number of duplicates of a single file, not overall. So you have more than 8000 duplicates of a single file!

During scan this is only a warning, the results in the db are correct. You can also run report without issues, this limit does not apply. It will prevent the ls/dups/uniques operations from working (although only when they hit one of these files in the large duplicate set).

You can identify the large set(s) by looking at the sqlite database directly:

% sqlite3 $HOME/.dupd_sqlite (or the appropriate db file if you override the location) sqlite> select * from duplicates where count > 8000;

Sometimes tiny files may end up having lots of duplicates. If the above shows that the files with more than 8000 duplicates are small and you don't mind ignoring them, one workaround is to specify --minsize during scan to skip them.

The largest set I have processed is a bit under 2M files, so if you run into any other issues handling 8M files let me know.

drasill commented 9 years ago

I'm not sure it is useful to be dynamic, as long a it is only a warning. 8k is a lot, indeed.

The thing is, the script warns each time it finds a new one above the limit (8001, 8002, 8003, etc) which is scary to look at.

I set the limit to 16k, I don't have the warning anymore, and as I now know it's only a warning, it's not so useful to change it dynamically.

Thank you for your detailed answer !

jvirkki commented 9 years ago

Removed this hardcoded limit.

jvirkki / dupd

setting of MAX_DUPLICATES should be dynamic #1