Closed alichaudry closed 3 years ago
Going through the other issues, perhaps the one here is related? The poster says "empty" files but it isn't clear whether they mean a zero-length file or no auto-gen output for the shell script to work on. I, too, am running rmlint on a zfs-backed FS -- but I'm not sure what's changed now as it's been working file for a couple weeks now.
$ rmlint --version version 2.9.0 compiled: Dec 31 2019 at [22:27:25] "Odd Olm" (rev 2)
Are you able to compile with the latest version? See https://rmlint.readthedocs.io/en/latest/install.html
Also is it possible that there are no duplicates at all in the 130127 files in backups/[1..9] ?
Are you able to compile with the latest version? See https://rmlint.readthedocs.io/en/latest/install.html
Yes, I will try to compile the latest version and loop back on how that works out for the two commands I ran.
Also is it possible that there are no duplicates at all in the 130127 files in backups/[1..9] ?
Good question! I executed a couple of targeted runs of rmlint
across different subsets of the data and removed many duplicates already. This last overarching run (covering the full data set) was intended to highlight any duplicates I may have missed in my targeted runs, so I was surprised to see it find so many duplicates when run with the --merge-directories
arg. Would you happen to know if there are ways to programmatically check for duplicates without using rmlint?
Would you happen to know if there are ways to programmatically check for duplicates without using rmlint?
Only with another dupe finder, maybe rdfind
, see https://www.tecmint.com/find-and-delete-duplicate-files-in-linux/
Okay, so I compiled the following version:
$ ./compiled/rmlint/rmlint --version
version 2.10.1 compiled: Apr 20 2021 at [22:08:44] "Ludicrous Lemur" (rev a8c39c8d)
compiled with: +mounts +nonstripped +fiemap +sha512 +bigfiles +intl +xattr +btrfs-support
rmlint was written by Christopher <sahib> Pahl and Daniel <SeeSpotRun> Thomas.
The code at https://github.com/sahib/rmlint is licensed under the terms of the GPLv3.
both of my runs now (with and without --merge-dirs) are showing the same result, that is, no auto-gen output in the two .sh
files. However -- this time the .json
file from the --merge-dirs this time has tons of data in it! Almost all of them are of type "part_of_directory", and skimming through a few it seems like some are git refs, whereas some are ipynb and manifest files. Other than the git records, all of them are really small, junk files.
I can try rdfind next.
Almost all of them are of type "part_of_directory", and skimming through a few it seems like some are git refs
Ok, I think I get it now; --treemerge
has somewhat confusing handling of hidden files (see #361, #383).
If you run rmlint --hidden
(but without --merge-dirs
then I suspect you'll find (and generate a script to remove) all of the hidden dupes. Be careful deleting files in .git
folders, that will break the git repo so may as well delete the whole .git
if you need the space.
Yes, that was it! Using the --hidden
flag (without --merge-dirs
) gave me similar results in the rmlint.json
as with --merge-dirs
on its own.
I believe we should be good now -- I'll close the issue. Thank you.
As the subject states, I'm getting different printed results with and without the
--merge-directories
option, but thermlint.sh
script has "nothing to do" when I execute them (that is, there is nothing to execute in the auto-generated portion of the script). For reference, here is the printed output from both runs (I had to delete the progress bars as they were messing up the formatting here):and my
dirs_list.txt
looks something like:the full output of my rmlint.json (without merge-dirs):
and the full output of my rmlint.json (with merge-dirs):
other info: