sahib / rmlint

Extremely fast tool to remove duplicates and other lint from your filesystem
http://rmlint.rtfd.org
GNU General Public License v3.0
1.91k stars 132 forks source link

Reported size of dupes found incorrect / missing dupes / cksum incorrect #436

Closed james-cook closed 3 years ago

james-cook commented 4 years ago

Possibly related to https://github.com/sahib/rmlint/issues/430

Raspi4 Rmlint version 2.10.1

I'm rmlinting at 1.8TB ntfs drive "A" full of videos against a 4TB ext4 drive "B" which should contain all of the vids on "A" (running rmlint to make sure)

du -hs of "A" reports 1.8TB A file count shows 80024 files on "A"

Rmlint run with -T "defaults,-emptyfiles" and --keep-all-tagged etc.

Rmlint reports "195276 files of which 77886 are dupes", which is in the right ballpark. But it then says, "this equals 122.04GB of duplicates which could be removed", which looks wrong, or suggests there are 2000 or so large videos which are being "missed" (80000-78000 approx.) E.g. 2000 files of 800MB each. Possible.

cat rmlint.sh | grep remove_cmd | wc -l Returns 77887 which is correct :)

As a next step I'm thinking of extracting the dupes from rmlint.sh or rmlint.json and adding all their sizes together. This way I can confirm if rmlint is ok or not.

Or would it be possible to print out the non-dupes on drive "A"?

james-cook commented 4 years ago

Or would it be possible to print out the non-dupes on drive "A"?

I followed this line.

  1. I used regexes and sed to extract the names, from rmlint.sh, of all the duplicates found, sorted to a file "C"
  2. I used find to create a list of all files on "A", sorted to a file "D"
  3. used comm -23 "D" "C" to create a list of all "missing dupes" "E"
  4. this file contains 2138 lines
  5. looking at the files its seems unlikely they account for the missing 1.6TB, but I need to check this...

I took a random entry in this file "E" and compared two videos "M9-0223.avi" on "A" and "B" using cmp echo $? shows 0, which should mean the files are byte for byte the same. Then doing: xattr -p user.rmlint.blake2.cksum on both files the values are NOT the same (xattr -p user.rmlint.blake2.mtime on both files the values ARE the same)

Doing md5sum on both files returns the same value, as does b2sum(!) The b2sum is different to both user.rmlint.blake2.cksums held in xattr for the files.

AFAICT there are still 2 problems:

  1. hash for dupes are not always the same
  2. report size of dupes found is incorrect

I am wondering if this is all to do with perhaps:

  1. running this on raspberry pi debian 32bit "in general"
  2. fact that drive "A" is ntfs and "B" is ext4

As reported in https://github.com/sahib/rmlint/issues/434 I do get "Added big file" warnings when running rmlint on the raspi4.

All suggestions welcome :)

james-cook commented 4 years ago

I ran du on the files "C" and "D" and got 1.5TB and 1.8TB respectively (i.e. 1.5TB is the value rmlint should have reported).

I ran du on "E" and got 361GB

sahib commented 4 years ago

Or would it be possible to print out the non-dupes on drive "A"?

It's possible to output unique files. Also see the examples with jq there. That's way easier than grepping rmlint.sh.

Then doing: xattr -p user.rmlint.blake2.cksum on both files the values are NOT the same (xattr -p user.rmlint.blake2.mtime on both files the values ARE the same)

And you are sure they were considered by rmlint (i.e. not a leftover from an interrupted, previous run) and the files were not written in-between and you run with xattr saving? This is not on by default.

I am wondering if this is all to do with perhaps:

  1. running this on raspberry pi debian 32bit "in general"
  2. fact that drive "A" is ntfs and "B" is ext4
  1. Maybe for the counters.
  2. Probably not.

I ran du on the files "C" and "D" and got 1.5TB and 1.8TB respectively (i.e. 1.5TB is the value rmlint should have reported).

Have you checked if there are hardlinks between dupes? rmlint only reports the size you gain by removing them. If you remove a hardlinked dupe, the size gain is zero. You should check the inodes of the files.


I will look into this more when I have some more time. Thanks for the issue report.

james-cook commented 4 years ago

Also see the examples with jq there. That's way easier than grepping rmlint.sh.

Good to know, but I had deleted json by accident anyway. Sed is not so bad...

Xattr previous run... Drive "A", the ntfs was only made writable to the current system ,raspi4, just before this run BUT may have been processed by rmlint in W10 WSL Ubuntu beforehand. Drive "B" was formatted as week ago, but I have run rmlint 4 or 5 times on all/parts of this drive. . I always run rmlint with --xattrs, should I be flushing these attributes between runs?

Kind of strange that the user.rmlint.blake2b.mtime is identical for the file in question.

Hardlinks - I'd be very surprised to find any such links. The files originate from Windows10,8 and I've never used links on windows.

All I am doing in this run is after having copied the files via rsync from ntfs to ext4, I'm making extra sure the files are really all on ext4. Pseudo code: Rmlint -g --T "defaults,-emptyfiles" --xattrs --keep-all-tagged ntfs-drive // ext4-drive

And you are sure they were considered by rmlint Well, the files don't appear in rmlint.sh so how can I best check this? Run with -vvvv? Check the "human" value of the mtime xattr...?

james-cook commented 4 years ago

Additional note: these are all ancient video files which have not been changed contents wise in at least 10 years. This is why it is so strange to find that the cksums are different.

I wonder if b2sum calculated under WSL W10 Ubuntu could possibly be different to b2sum under raspbian... And whether rsync when copying possibly changes the xattr value. (I used rsync av) It seems far fetched. The majority of dupes where correctly identified, just ca. 2000 were not...

AFAICT the only safe way to proceed would be to delete the user.... cksum xattr for all files on both sides. I did this for the single file mentioned above and rmlint worked correctly.

james-cook commented 4 years ago

I went back to W10 WSL Ubuntu and ran rmlint on a directory containing the above file twice so a cksum was generated. This was the same as the cksum generated by rmlint on raspberry 4. I don't have the resources to fully investigate the problem unfortunately. So I ended up deleting all the xattrs on the raspberry pi (this takes ages!). My concern is not so much about false mismatches but false matches(!).

Somewhere during the last weeks incorrect b2sums were generated.

Recap: I was initially using rmlint on W10 Wsl Ubuntu and then moved these files to ext4 on raspberry pi using rsync. Comparing the resultant copies, W10 to raspberry pi, some files did not match although the majority did. Strangely in these cases the cksums for the file on W10 and on raspberry pi were BOTH incorrect. I thought mtime was the time of cksum generation which made things seem even more weird, but mtime is the file mtime at time of cksum generation. All the files are old videos which haven't changed in years.

Long story short, without some really verbose logging this is hard/impossible to debug(!) I did at some point try the "match directories" option with rmlint... I'm not sure this should affect cksums in any way though.

nijhawank commented 3 years ago

This seems to be related to issue #439 that also says that the checksum stored in the extended attribute is incorrect.

SeeSpotRun commented 3 years ago

I think @nijhawank is correct here. The behaviour of --write-unfinished was confusing and is now deprecated on the develop branch https://github.com/sahib/rmlint/tree/develop branch

New options --hash-unmatched and --hash-uniques provide a more robust alternative. Both only write completed checksums.

So now rmlint --xattr --hash-uniques will hash every (non-zero-size) file in the search path.

And rmlint --xattr --hash-unmatched will only hash files that might have had a duplicate. This generally means there was another file with the exact same size.

Closing issue. Feel free to re-open if you want to continue the conversation.

cebtenzzre commented 2 years ago

I'm going to assume there really were hardlinks in this case. Possibly on the NTFS volume in the form of DOS 8.3 filenames - this is likely if it was/is the system disk of a Windows PC. When these are present, AFAIK ntfs-3g counts them in st_nlink even though unlinking either name removes both, and only the long one shows up in directory listings. This is an unfortunate corner-case but I don't think there is a simple fix that preserves the clear, useful output on native filesystems.