Open szabi opened 5 years ago
Thanks for mentioning. The benchmarks need some updating anyways, I'll try to put it on my list for cloudy days.
I'm building a dockerized environment for benchmarking file duplicate finders, including rmlint, jdupes, dupd, and others.
Two questions:
Hello @maxsu,
tests/test_speed
. The script that generated the plots is this one here. The secret sauce is the python library pygal
. Performance differs wildly on the ratio between number of files and average file size. Consider different use-cases like to stay close to what users do:
/usr
(very many files, almost all small)Apart from that there other criterias that make it hard to compare:
sync; echo 3 > /proc/sys/vm/drop_caches
)In summary you should provide a few different data sets (or scripts to generate them rather) that allow to make out differences between the tools.
If I would do it again I would probably put together a script that takes a directory and copies the structure, but would fill it with dummy data that can be driven by a number of options (e.g. percentage of duplicates). This would allow to create test workload that are close to real user data (e.g. by pointing the script to a backup folder).
I would be interested in the results obviously. Especially if it would allow me to deprecate the benchmark page and link to your results. :smirk:
(Ping for @SeeSpotRun.)
The section about the number of files found reports 0/0 for one of the other tools and the text following claiming a small difference between
rmlint
andfdupes
does not match up with the data shown (the numbers are equal).