readthedocs benchmark page out of date/out of sync

szabi commented 5 years ago

The section about the number of files found reports 0/0 for one of the other tools and the text following claiming a small difference between rmlint and fdupes does not match up with the data shown (the numbers are equal).

sahib commented 5 years ago

Thanks for mentioning. The benchmarks need some updating anyways, I'll try to put it on my list for cloudy days.

maxsu commented 4 years ago

I'm building a dockerized environment for benchmarking file duplicate finders, including rmlint, jdupes, dupd, and others.

Two questions:

Can you share code for producing the graphs in https://github.com/sahib/rmlint/blob/master/docs/benchmarks.rst
Do you have recommendations for generating the test workloads?

sahib commented 4 years ago

Hello @maxsu,

The code is part of the repository can be found in tests/test_speed. The script that generated the plots is this one here. The secret sauce is the python library pygal.
Performance differs wildly on the ratio between number of files and average file size. Consider different use-cases like to stay close to what users do:
- Running on a music collection (mostly equally sized files, relatively small sizes and numbers)
- Running on /usr (very many files, almost all small)
- Running on a backup (huge, wildy different files sizes)
Apart from that there other criterias that make it hard to compare:
- Percentage of duplicates in data set.
- Number of hardlinks / symbolic links / reflinks / sparse files.
- Type of underlying disk (HDD, SSD, tmpfs...)
- The kind of filesystem in use.
- Hashing algorithm (although the rule should be "just use the application's default" ).
- Caching (well, this can be tackled by sync; echo 3 > /proc/sys/vm/drop_caches)
- In case of HDDs how the data is aligned on the disk (i.e. seek thrasing)
- (...list can be probably extended...)
In summary you should provide a few different data sets (or scripts to generate them rather) that allow to make out differences between the tools.

If I would do it again I would probably put together a script that takes a directory and copies the structure, but would fill it with dummy data that can be driven by a number of options (e.g. percentage of duplicates). This would allow to create test workload that are close to real user data (e.g. by pointing the script to a backup folder).

I would be interested in the results obviously. Especially if it would allow me to deprecate the benchmark page and link to your results. :smirk:

(Ping for @SeeSpotRun.)

sahib / rmlint

readthedocs benchmark page out of date/out of sync #324