Closed GHANONTEST closed 10 years ago
I was actually planning to wait until I'd finished tuning it to be maximally efficient on non-SSD devices and then move it to GitHub but, if you think it's good enough now, that's fine with me. :)
(At the moment, the design and tunings in the released version aren't ideal for things with high seek time. It's still faster than the stuff I used to use, but I was planning to actually do some cold-cache benchmarking and redesign this summer.)
It's probably superior to mine all around, but I haven't had time to test. I do notice that the documentation doesn't explain how the paranoid-mode comparison work. It's possible that either my current or planned approach to byte-for-byte comparison could be more efficient.
Hi there,
@ssokolow: I tried to run fastdupes.py, but got the following Traceback:
chris@sloth /tmp » python2 fastdupes.py ./
Found (170, 170) files to be compared for duplication.
Found 14 sets of files with identical sizes. (151 files examined)
Found 13 sets of files with identical heads. (14 sets examined)
Scanning for real duplicates... 0 of 13 sets processed
Traceback (most recent call last):
File "fastdupes.py", line 467, in <module>
groups = subgroupByHashes(groups)
File "fastdupes.py", line 290, in subgroupByHashes
raise NotImplementedError("TODO: Finish implementing this")
NotImplementedError: TODO: Finish implementing this
Is there maybe a more recent version?
It's probably superior to mine all around, but I haven't had time to test. I do notice that the documentation doesn't explain how the paranoid-mode comparison work. It's possible that either my current or planned approach to byte-for-byte comparison could be more efficient.
Could be. Currently for --paranoid the file is reopened (or mmap()'d for smaller files), and a block-wise read()/memcmp() is done. Responsible function: https://github.com/sahib/rmlint/blob/master/src/mode.c#L152
Not sure if rmlint is superior, the code ist a bit uh.. I was young and didn't know better :smile: I wanted to rewrite it already, but other projects / studies keep me occupied. Hopefully after this semester.
I notice that it doesn't seem to claim to support hardlinking as a resolution for duplicate matches
Not sure what exactly you mean, if you pass the same file/folder twice to the cmd, files with duplicate Inodes it get filtered out at least.
@GHANONTEST: I will wait till @ssokolow finished his tuning. Also note, the comparasion is not really a good one.
Hardlinking is already possible with -c:
chris@chris /tmp/testdir » echo 'Hello World' > a
chris@chris /tmp/testdir » echo 'Hello World' > b
chris@chris /tmp/testdir » rmlint a b -m cmd -c 'rm <dupl> && ln <orig> <dupl>' -v0
chris@chris /tmp/testdir » ls -li a b # same inodes
10658700 -rw-r--r-- 2 chris users 2 28. Mai 17:15 a
10658700 -rw-r--r-- 2 chris users 2 28. Mai 17:15 b
Edit: You can always help documenting, and reporting - as you do now.
@GHANONTEST: Actually, symlinking is possible with -m link
according to the docs. My main issue with rmlint is that I don't approve of having to use the generic command fallback for hardlinking because it's a prime candidate for user error. I feel that, at least in that respect, it needs improvement. (I know I took forever to reliably learn the order of the arguments to ln
)
My script has hardlinking planned as an addition and, when I do add it, it'll follow the existing pattern of being extremely careful to prevent a confused user from being able to accidentally delete all copies of a file.
@sahib: That's definitely not right. I must've accidentally pushed the wrong copy to the server when I was fixing up a mess in ~/src
about six months ago. (My motherboard gave up the ghost and I took it as an opportunity to fix things up using my slow, old backup PC)
I've got an exam I'm cramming for right now, but I'll try to find the right copy in what time I can spare. If I can't find it, I'm done on June 1st and I'll rewrite it.
@GHANONTEST: Even if @sahib weren't implying that rmlint's code is ugly, I really detest working in C. (I can do it, but only under protest)
FastDupes, like everything else, is a hobby project. I do occasionally code in Vala (while compiles to C), but I'd probably be more willing to re-implement it from scratch in RPython and use the PyPy translator to convert it to optimized native code.
@sahib Sorry for the wait. I was unexpectedly busy over the last couple of weeks.
Somehow, the copy that was available was the form it briefly took where it defaulted to hash-based comparison but didn't have it implemented. (The one you tried should have worked if you used --exact
for seek-heavy, memory-efficient, SSD-optimized exact comparison)
I couldn't find the finished copy I had at some point prior to hosting it on GitHub Pages, so I've committed a rewrite of it.
Closing, will do a benchmark with the develop-branch at some time with all current alternatives.
Just as a follow up: There are some benchmarks now: http://rmlint.readthedocs.org/en/latest/benchmarks.html
http://ssokolow.com/scripts/index.html#fastdupes.py
and add a comparison between the two....
@ssokolow :P