pixelb / fslint

Linux file system lint checker/cleaner
319 stars 72 forks source link

[Improvement] Feature to check for files [on an external disk] which are *not* present somewhere on the [backup] disk #162

Open Wikinaut opened 5 years ago

Wikinaut commented 5 years ago

I wish to have a feature which makes intelligent use of the checksum/hashes of the huge "backup" drive X so that - when I connect a smaller drive Z to my computer - so that I can quickly list all those files which are

This is a "one-way" check. I don't want to have the huge list of differences. I only want to know those files from Z which for one reason or another have not been copied (or later moved) to drive X, on any directory there. So basically, it's a checksum/hash issue.

Wikinaut commented 5 years ago

Hello, can we talk about such a new feature? If you wish, I can explain again why rsync is not a solution.

It's something like https://askubuntu.com/a/767988

fdupes is an excellent program to find the duplicate files but it does not list the non-duplicate files, which is what you are looking for. However, we can list the files that are not in the fdupes output using a combination of find and grep.

pixelb commented 5 years ago

OK an rsync solution should work if the structure in the dest was similar to that in the source. I.E. something like rsync -rl --dry-run --out-format="%f" --checksum Z/ X/

So I presume the structure of your source Z is different to that in dest X. I.E. you want to list files not backed up, no matter where they are in Z, so that you can copy them to the appropriate location in X etc.

So you want the equivalent of the following, but with more efficient handling of unique file sizes etc:

    $ SRC=Z/; DST=X/
    $ find $SRC $DST -type f | xargs md5sum | sed "\|  $DST|p" |
      sort | uniq -w32 -u | cut -d' ' -f3

One could avoid the overhead of scanning and checksumming $DST if it was not updated between fslint dedupe runs. In that case fslint could write and index of size,checksum,name which could be used directly in the process above

Wikinaut commented 5 years ago

Yes, the structure is different, or may be different, so we have to "search" for the file hash.

I also found this proposal for "fdupes" https://github.com/adrianlopezroche/fdupes/issues/19

It would be good to save the hash/parse/analyze information of a specific fdupes run, in order to compare later this "virtual"files tree with a real file tree.


Currently I run the suggested sequence from https://askubuntu.com/a/767988 (see above): to list the files which are unique to backup (Z in my example), i. e. which are in backup but not in documents. [My use case is vice versa: to look for files which are not yet somewhere in the "backup"]

fdupes -r backup/ documents/ > dup.txt
find backup/ -type f | grep -Fxvf dup.txt