sahib / rmlint

Extremely fast tool to remove duplicates and other lint from your filesystem
http://rmlint.rtfd.org
GNU General Public License v3.0
1.86k stars 128 forks source link

Using --replay with --size #540

Open rrueger opened 2 years ago

rrueger commented 2 years ago

I recently ran rmlint -c sh:clone -o sh:rmlint.sh on a large subvolume.

Upon inspection, I saw that there are many very small files that would need to be cloned.

Using the --replay functionality, I would like to be able to create a new rmlint-10M+.sh file, that only clones files greater in size than 10M.

Since the manual states that --replay doesn't perform any disk i/o, I would expect that

rmlint -c sh:clone -o sh:rmlint-10M+.sh --size=10M --replay rmlint.json

is a legal command that would take the files from rmlint.json and create a script that only clones only files with sizes greater (or equal to) 10M.

Instead, rmlint produces an empty* rmlint-10M+.sh script.

It looks like I need to also specify the path along with the rmlint ... --replay call. According to glances, doing this produces a non-trivial (~1MB/s) amount of i/o long into execution (so likely not just reading the json file).

*Empty, as in, the autogenerated component is empty. All the handler definitions are still there.

rrueger commented 2 years ago

Edit: Accidentally pressed Enter not Shift+Enter for a new line, submitting the issue prematurely.

cebtenzzre commented 1 year ago

replay mode has called stat on paths since its inception in 2015, so the documentation is not accurate in its claim that replay "does no input/output". It does avoid recomputing file hashes, but it still reads file metadata from disk. And, quoting the docs:

Usage is simple: Just pass --replay on the second run, with other changed to the new formatters or filters. Pass the .json files of the previous runs additionally to the paths you ran rmlint on.

Do you think the docs should be clarified? Would it be more convenient if --replay processed all files by default instead of none of them?