sahib / rmlint

Extremely fast tool to remove duplicates and other lint from your filesystem
http://rmlint.rtfd.org
GNU General Public License v3.0
1.86k stars 128 forks source link

Different results printed with and without --merge-directories, but same (empty) autogen output scripts #489

Closed alichaudry closed 3 years ago

alichaudry commented 3 years ago

As the subject states, I'm getting different printed results with and without the --merge-directories option, but the rmlint.sh script has "nothing to do" when I execute them (that is, there is nothing to execute in the auto-generated portion of the script). For reference, here is the printed output from both runs (I had to delete the progress bars as they were messing up the formatting here):

$ cat dirs_list.txt | rmlint --merge-directories --progress -

Traversing (144328 usable files / 0 + 0 ignored files / folders)
Preprocessing (reduces files to 78207 / found 0 other lint)
Matching (1654 dupes of 1085 originals; 0 B to scan in 0 files, ETA: 10s)
Merging files into directories (stand by...)

==> In total 144328 files, whereof 1654 are duplicates in 1085 groups.
==> This equals 301.93 MB of duplicates which could be removed.
==> Scanning took in total  7m 35.171s.

Wrote a json file to: ~/tmp/rmlint.json
Wrote a sh file to: ~/tmp/rmlint.sh
$ cat dirs_list.txt | rmlint --progress -

Traversing (130127 usable files / 575 + 158 ignored files / folders)
Preprocessing (reduces files to 65356 / found 0 other lint)
Matching (0 dupes of 0 originals; 13.62 GB to scan
Matching (0 dupes of 0 originals; 0 B to scan in 0 files, ETA: 18m 37s)

==> In total 130127 files, whereof 0 are duplicates in 0 groups.
==> This equals 0 B of duplicates which could be removed.
==> Scanning took in total 11m 47.168s.

Wrote a json file to: ~/tmp/rmlint.json
Wrote a sh file to: ~/tmp/rmlint.sh

and my dirs_list.txt looks something like:

/backups/1
/backups/2
/backups/3
/backups/4
/backups/5
/backups/6
/backups/7
/backups/8
/backups/9

the full output of my rmlint.json (without merge-dirs):

[
{
  "description": "rmlint json-dump of lint files",
  "cwd": "~/tmp/",
  "args": "rmlint --progress -",
  "version": "2.9.0",
  "rev": "2",
  "progress": 0,
  "checksum_type": "blake2b"
}, {
  "aborted": false,
  "progress": 100,
  "total_files": 130127,
  "ignored_files": 575,
  "ignored_folders": 158,
  "duplicates": 0,
  "duplicate_sets": 0,
  "total_lint_size": 0
}]

and the full output of my rmlint.json (with merge-dirs):

[
{
  "description": "rmlint json-dump of lint files",
  "cwd": "~/tmp/",
  "args": "rmlint --merge-directories --progress -",
  "version": "2.9.0",
  "rev": "2",
  "progress": 0,
  "checksum_type": "blake2b"
}, {
  "aborted": false,
  "progress": 100,
  "total_files": 144328,
  "ignored_files": 0,
  "ignored_folders": 0,
  "duplicates": 1654,
  "duplicate_sets": 1085,
  "total_lint_size": 316593043
}]

other info:

$ rmlint --version
version 2.9.0 compiled: Dec 31 2019 at [22:27:25] "Odd Olm" (rev 2)
compiled with: +mounts +nonstripped +fiemap +sha512 +bigfiles +intl +replay +xattr +btrfs-support

rmlint was written by Christopher <sahib> Pahl and Daniel <SeeSpotRun> Thomas.
The code at https://github.com/sahib/rmlint is licensed under the terms of the GPLv3.
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:        20.04
Codename:       focal
$ uname -a
Linux delhi 5.4.0-56-generic #62-Ubuntu SMP Mon Nov 23 19:20:19 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ apt list --installed | grep -i [rR]mlint

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

rmlint-gui/focal,focal,now 2.9.0-2 all [installed,automatic]
rmlint/focal,now 2.9.0-2 amd64 [installed]
alichaudry commented 3 years ago

Going through the other issues, perhaps the one here is related? The poster says "empty" files but it isn't clear whether they mean a zero-length file or no auto-gen output for the shell script to work on. I, too, am running rmlint on a zfs-backed FS -- but I'm not sure what's changed now as it's been working file for a couple weeks now.

SeeSpotRun commented 3 years ago

$ rmlint --version version 2.9.0 compiled: Dec 31 2019 at [22:27:25] "Odd Olm" (rev 2)

Are you able to compile with the latest version? See https://rmlint.readthedocs.io/en/latest/install.html

SeeSpotRun commented 3 years ago

Also is it possible that there are no duplicates at all in the 130127 files in backups/[1..9] ?

alichaudry commented 3 years ago

Are you able to compile with the latest version? See https://rmlint.readthedocs.io/en/latest/install.html

Yes, I will try to compile the latest version and loop back on how that works out for the two commands I ran.

Also is it possible that there are no duplicates at all in the 130127 files in backups/[1..9] ?

Good question! I executed a couple of targeted runs of rmlint across different subsets of the data and removed many duplicates already. This last overarching run (covering the full data set) was intended to highlight any duplicates I may have missed in my targeted runs, so I was surprised to see it find so many duplicates when run with the --merge-directories arg. Would you happen to know if there are ways to programmatically check for duplicates without using rmlint?

SeeSpotRun commented 3 years ago

Would you happen to know if there are ways to programmatically check for duplicates without using rmlint?

Only with another dupe finder, maybe rdfind, see https://www.tecmint.com/find-and-delete-duplicate-files-in-linux/

alichaudry commented 3 years ago

Okay, so I compiled the following version:

$ ./compiled/rmlint/rmlint --version
version 2.10.1 compiled: Apr 20 2021 at [22:08:44] "Ludicrous Lemur" (rev a8c39c8d)
compiled with: +mounts +nonstripped +fiemap +sha512 +bigfiles +intl +xattr +btrfs-support

rmlint was written by Christopher <sahib> Pahl and Daniel <SeeSpotRun> Thomas.
The code at https://github.com/sahib/rmlint is licensed under the terms of the GPLv3.

both of my runs now (with and without --merge-dirs) are showing the same result, that is, no auto-gen output in the two .sh files. However -- this time the .json file from the --merge-dirs this time has tons of data in it! Almost all of them are of type "part_of_directory", and skimming through a few it seems like some are git refs, whereas some are ipynb and manifest files. Other than the git records, all of them are really small, junk files.

I can try rdfind next.

SeeSpotRun commented 3 years ago

Almost all of them are of type "part_of_directory", and skimming through a few it seems like some are git refs

Ok, I think I get it now; --treemerge has somewhat confusing handling of hidden files (see #361, #383).

If you run rmlint --hidden (but without --merge-dirs then I suspect you'll find (and generate a script to remove) all of the hidden dupes. Be careful deleting files in .git folders, that will break the git repo so may as well delete the whole .git if you need the space.

alichaudry commented 3 years ago

Yes, that was it! Using the --hidden flag (without --merge-dirs) gave me similar results in the rmlint.json as with --merge-dirs on its own.

I believe we should be good now -- I'll close the issue. Thank you.