sahib / rmlint

Extremely fast tool to remove duplicates and other lint from your filesystem
http://rmlint.rtfd.org
GNU General Public License v3.0
1.91k stars 132 forks source link

Improve ETA estimation to consider IOPS for small files #473

Closed tcrossland closed 3 years ago

tcrossland commented 3 years ago

It seems that currently the ETA is calculated only on the number of bytes scanned vs remaining, without considering the number of files or IOPS. In cases with a high distribution of small files, this isn't very meaningful, as although large files account for 99% of the bytes scanned, smaller files represent a larger proportion of IOPS. In these scenarios, we might get something like this (real world example):

Matching (2880681 dupes of 54748 originals; 0,71 MB to scan in 12477677 files, ETA: 0s)

Assuming we can scan 2000 small files per second, we'd still need to wait more than an hour for this, whereas the ETA indicates we're finished.

SeeSpotRun commented 3 years ago

Fair point. Any suggestions on how much weight to give to a zero-size file? I'm thinking maybe 4k bytes per file.

tcrossland commented 3 years ago

I think a typical inode size is 256 bytes (depending on file system). Also, I guess files with size zero should be ignored completely for deduplication.

SeeSpotRun commented 3 years ago

I think it's 4096 on btrfs.

On spinning media it's going to be seek time that is the killer.

Anyway I'll try with 4096 and you can let me know if the ETA estimates improve.

SeeSpotRun commented 3 years ago

https://github.com/SeeSpotRun/rmlint/tree/better_ETA if you want to give it a whirl

tcrossland commented 3 years ago

Hmmm, scons config worked fine but it's not linking, I'm sure it's something trivial but haven't figured it out yet...

Linking Program ==> rmlint
/usr/bin/ld: librmlint.a(utilities.o): in function `rm_mounts_create_tables':
utilities.c:(.text+0x152a): undefined reference to `minor'
/usr/bin/ld: utilities.c:(.text+0x153d): undefined reference to `major'
/usr/bin/ld: utilities.c:(.text+0x1627): undefined reference to `minor'
/usr/bin/ld: utilities.c:(.text+0x1637): undefined reference to `major'
/usr/bin/ld: utilities.c:(.text+0x1653): undefined reference to `minor'
/usr/bin/ld: utilities.c:(.text+0x1666): undefined reference to `major'
/usr/bin/ld: utilities.c:(.text+0x17b8): undefined reference to `makedev'
tcrossland commented 3 years ago

@SeeSpotRun Subjectively, I think the new ETA estimates work a lot better. If anything, they seem to be a bit conservative (actual time seems to take less than ETA), but that's not necessarily a bad thing from the waiting user's point of view. The ETA still jumps around quite a lot during long scans (see #343) but seems to coalesce to a sensible value towards the end of the scan. Thanks!

SeeSpotRun commented 3 years ago

Yeah I'll have a closer look at the averaging algorithm, it's pretty basic at the moment. But if you've ever worked with Windows progress bars I think you'll agree it could be a lot worse.

tcrossland commented 3 years ago

:) Absolutely... I wasn't criticizing, I think it's really useful (even more so with this change). Thanks for your work

SeeSpotRun commented 3 years ago

Addressed by #481