sahib / rmlint

Extremely fast tool to remove duplicates and other lint from your filesystem
http://rmlint.rtfd.org
GNU General Public License v3.0
1.87k stars 130 forks source link

add this to your alternatives/competitors list #14

Closed GHANONTEST closed 10 years ago

GHANONTEST commented 12 years ago

http://ssokolow.com/scripts/index.html#fastdupes.py

and add a comparison between the two....

@ssokolow :P

ssokolow commented 12 years ago

I was actually planning to wait until I'd finished tuning it to be maximally efficient on non-SSD devices and then move it to GitHub but, if you think it's good enough now, that's fine with me. :)

(At the moment, the design and tunings in the released version aren't ideal for things with high seek time. It's still faster than the stuff I used to use, but I was planning to actually do some cold-cache benchmarking and redesign this summer.)

ssokolow commented 12 years ago

It's probably superior to mine all around, but I haven't had time to test. I do notice that the documentation doesn't explain how the paranoid-mode comparison work. It's possible that either my current or planned approach to byte-for-byte comparison could be more efficient.

sahib commented 12 years ago

Hi there,

@ssokolow: I tried to run fastdupes.py, but got the following Traceback:

chris@sloth /tmp » python2 fastdupes.py ./                                                                                                                                                    
Found (170, 170) files to be compared for duplication.      
Found 14 sets of files with identical sizes. (151 files examined)           
Found 13 sets of files with identical heads. (14 sets examined) 
Scanning for real duplicates... 0 of 13 sets processed
   Traceback (most recent call last):
      File "fastdupes.py", line 467, in <module>
        groups = subgroupByHashes(groups)
      File "fastdupes.py", line 290, in subgroupByHashes
        raise NotImplementedError("TODO: Finish implementing this")
 NotImplementedError: TODO: Finish implementing this

Is there maybe a more recent version?

It's probably superior to mine all around, but I haven't had time to test. I do notice that the documentation doesn't explain how the paranoid-mode comparison work. It's possible that either my current or planned approach to byte-for-byte comparison could be more efficient.

Could be. Currently for --paranoid the file is reopened (or mmap()'d for smaller files), and a block-wise read()/memcmp() is done. Responsible function: https://github.com/sahib/rmlint/blob/master/src/mode.c#L152

Not sure if rmlint is superior, the code ist a bit uh.. I was young and didn't know better :smile: I wanted to rewrite it already, but other projects / studies keep me occupied. Hopefully after this semester.

I notice that it doesn't seem to claim to support hardlinking as a resolution for duplicate matches

Not sure what exactly you mean, if you pass the same file/folder twice to the cmd, files with duplicate Inodes it get filtered out at least.

@GHANONTEST: I will wait till @ssokolow finished his tuning. Also note, the comparasion is not really a good one.

sahib commented 12 years ago

Hardlinking is already possible with -c:

chris@chris /tmp/testdir » echo 'Hello World' > a
chris@chris /tmp/testdir » echo 'Hello World' > b
chris@chris /tmp/testdir » rmlint a b -m cmd -c 'rm <dupl> && ln <orig> <dupl>' -v0
chris@chris /tmp/testdir » ls -li a b # same inodes
10658700 -rw-r--r-- 2 chris users 2 28. Mai 17:15 a
10658700 -rw-r--r-- 2 chris users 2 28. Mai 17:15 b

Edit: You can always help documenting, and reporting - as you do now.

ssokolow commented 12 years ago

@GHANONTEST: Actually, symlinking is possible with -m link according to the docs. My main issue with rmlint is that I don't approve of having to use the generic command fallback for hardlinking because it's a prime candidate for user error. I feel that, at least in that respect, it needs improvement. (I know I took forever to reliably learn the order of the arguments to ln)

My script has hardlinking planned as an addition and, when I do add it, it'll follow the existing pattern of being extremely careful to prevent a confused user from being able to accidentally delete all copies of a file.

@sahib: That's definitely not right. I must've accidentally pushed the wrong copy to the server when I was fixing up a mess in ~/src about six months ago. (My motherboard gave up the ghost and I took it as an opportunity to fix things up using my slow, old backup PC)

I've got an exam I'm cramming for right now, but I'll try to find the right copy in what time I can spare. If I can't find it, I'm done on June 1st and I'll rewrite it.

ssokolow commented 12 years ago

@GHANONTEST: Even if @sahib weren't implying that rmlint's code is ugly, I really detest working in C. (I can do it, but only under protest)

FastDupes, like everything else, is a hobby project. I do occasionally code in Vala (while compiles to C), but I'd probably be more willing to re-implement it from scratch in RPython and use the PyPy translator to convert it to optimized native code.

ssokolow commented 12 years ago

@sahib Sorry for the wait. I was unexpectedly busy over the last couple of weeks.

Somehow, the copy that was available was the form it briefly took where it defaulted to hash-based comparison but didn't have it implemented. (The one you tried should have worked if you used --exact for seek-heavy, memory-efficient, SSD-optimized exact comparison)

I couldn't find the finished copy I had at some point prior to hosting it on GitHub Pages, so I've committed a rewrite of it.

sahib commented 10 years ago

Closing, will do a benchmark with the develop-branch at some time with all current alternatives.

sahib commented 8 years ago

Just as a follow up: There are some benchmarks now: http://rmlint.readthedocs.org/en/latest/benchmarks.html