Closed tcrossland closed 3 years ago
Fair point. Any suggestions on how much weight to give to a zero-size file? I'm thinking maybe 4k bytes per file.
I think a typical inode size is 256 bytes (depending on file system). Also, I guess files with size zero should be ignored completely for deduplication.
I think it's 4096 on btrfs.
On spinning media it's going to be seek time that is the killer.
Anyway I'll try with 4096 and you can let me know if the ETA estimates improve.
https://github.com/SeeSpotRun/rmlint/tree/better_ETA if you want to give it a whirl
Hmmm, scons config
worked fine but it's not linking, I'm sure it's something trivial but haven't figured it out yet...
Linking Program ==> rmlint
/usr/bin/ld: librmlint.a(utilities.o): in function `rm_mounts_create_tables':
utilities.c:(.text+0x152a): undefined reference to `minor'
/usr/bin/ld: utilities.c:(.text+0x153d): undefined reference to `major'
/usr/bin/ld: utilities.c:(.text+0x1627): undefined reference to `minor'
/usr/bin/ld: utilities.c:(.text+0x1637): undefined reference to `major'
/usr/bin/ld: utilities.c:(.text+0x1653): undefined reference to `minor'
/usr/bin/ld: utilities.c:(.text+0x1666): undefined reference to `major'
/usr/bin/ld: utilities.c:(.text+0x17b8): undefined reference to `makedev'
@SeeSpotRun Subjectively, I think the new ETA estimates work a lot better. If anything, they seem to be a bit conservative (actual time seems to take less than ETA), but that's not necessarily a bad thing from the waiting user's point of view. The ETA still jumps around quite a lot during long scans (see #343) but seems to coalesce to a sensible value towards the end of the scan. Thanks!
Yeah I'll have a closer look at the averaging algorithm, it's pretty basic at the moment. But if you've ever worked with Windows progress bars I think you'll agree it could be a lot worse.
:) Absolutely... I wasn't criticizing, I think it's really useful (even more so with this change). Thanks for your work
Addressed by #481
It seems that currently the ETA is calculated only on the number of bytes scanned vs remaining, without considering the number of files or IOPS. In cases with a high distribution of small files, this isn't very meaningful, as although large files account for 99% of the bytes scanned, smaller files represent a larger proportion of IOPS. In these scenarios, we might get something like this (real world example):
Matching (2880681 dupes of 54748 originals; 0,71 MB to scan in 12477677 files, ETA: 0s)
Assuming we can scan 2000 small files per second, we'd still need to wait more than an hour for this, whereas the ETA indicates we're finished.