sahib / rmlint

Extremely fast tool to remove duplicates and other lint from your filesystem
http://rmlint.rtfd.org
GNU General Public License v3.0
1.91k stars 132 forks source link

Combining merge-directories with follow-symlinks fails #564

Open christianlupus opened 2 years ago

christianlupus commented 2 years ago

Hello. I wanted to use rmlint to break down the required amount of manual work to merge multiple file structures.

I have some files on my hard drive and made in the past backups on an external HDD. In the meantime, I renamed some folders, moved them around, and in the end put them into a git-annex repository for management. Now, I wanted to merge the files again. As these are quite some files, I wanted to use rmlint to reduce the number of files as much as possible.

On the HDDs, there are some collections of images, brushes, font files, etc organized in folders/bundles. I want to keep the overall structure of the bundles so that I see in which bundle a certain file is located. This is especially relevant to keep an overview of the licenses of the files. Some bundles contain duplicates (e.g. a brush is in multiple bundles). Thus, I need to check if a complete folder was copied over, not just some random duplicate files. Otherwise, my bundles might get broken apart and I lose the license information.

As the files are put within a git-annex repository, the files are replaced by symlinks. See also below for some details and a step-by-step routine to reproduce it.


Short introduction of git-annex: Here the files are stored in a dedicated location outside the working dir and the working dir only consists of symlinks. Then, one can move the real files around but keep the symlinks in place as a cheap reference.

To have an example script as an MWE:

# Go to a sandbox folder
mkdir /tmp/test-rmlint
cd /tmp/rmlint
# Create basic structure
mkdir annex direct direct/plain
# Create some test files
for i in 1 2 3; do dd if=/dev/urandom of=direct/plain/file$i bs=1M count=1; done
cp direct/plain/* direct
# Init git annex folder
cd annex
git init .
git annex init
# Copy files to annex
cp -r ../direct/* .
git annex add .
git commit -m 'Initial commit'
# Leave annex folder
cd ..

In my case I get a file structure as following file structure (the symlinks will have different names due to the random content):

$ tree
.
├── annex
│   ├── file1 -> .git/annex/objects/0f/w0/SHA256E-s1048576--60b5d64d265abc0d335821cb41ffdf65b7876b19529c9ae643c52e1b4e5e4f42/SHA256E-s1048576--60b5d64d265abc0d335821cb41ffdf65b7876b19529c9ae643c52e1b4e5e4f42
│   ├── file2 -> .git/annex/objects/16/Kk/SHA256E-s1048576--f2c0e99f011039ae16380d40b44586161cc12cc79607cf41afe1d53af76a0260/SHA256E-s1048576--f2c0e99f011039ae16380d40b44586161cc12cc79607cf41afe1d53af76a0260
│   ├── file3 -> .git/annex/objects/jx/14/SHA256E-s1048576--2b658ba9c167d4c7067a7d097de14155ed3b37667f4c5f4dad74521105c6c90b/SHA256E-s1048576--2b658ba9c167d4c7067a7d097de14155ed3b37667f4c5f4dad74521105c6c90b
│   └── plain
│       ├── file1 -> ../.git/annex/objects/0f/w0/SHA256E-s1048576--60b5d64d265abc0d335821cb41ffdf65b7876b19529c9ae643c52e1b4e5e4f42/SHA256E-s1048576--60b5d64d265abc0d335821cb41ffdf65b7876b19529c9ae643c52e1b4e5e4f42
│       ├── file2 -> ../.git/annex/objects/16/Kk/SHA256E-s1048576--f2c0e99f011039ae16380d40b44586161cc12cc79607cf41afe1d53af76a0260/SHA256E-s1048576--f2c0e99f011039ae16380d40b44586161cc12cc79607cf41afe1d53af76a0260
│       └── file3 -> ../.git/annex/objects/jx/14/SHA256E-s1048576--2b658ba9c167d4c7067a7d097de14155ed3b37667f4c5f4dad74521105c6c90b/SHA256E-s1048576--2b658ba9c167d4c7067a7d097de14155ed3b37667f4c5f4dad74521105c6c90b
└── direct
    ├── file1
    ├── file2
    ├── file3
    └── plain
        ├── file1
        ├── file2
        └── file3

4 directories, 12 files

Now, you can issue

$ LANG=C rmlint -kmDRf -T minimaldirs direct // annex
==> Note: Please use the saved script below for removal, not the above output.
==> In total 73 files, whereof 3 are duplicates in 0 groups.
==> This equals 6.00 MB of duplicates which could be removed.
==> Scanning took in total 10.309s.

Wrote a sh file to: /tmp/annex/new/rmlint.sh
Wrote a json file to: /tmp/annex/new/rmlint.json

Regarding the flags:

The resulting `rmlint.json` is here as well. ```json [ { "description": "rmlint json-dump of lint files", "cwd": "/tmp/annex/new/", "args": "rmlint -kmDRf -T minimaldirs direct // annex", "version": "2.10.1", "rev": "unknown", "progress": 0, "checksum_type": "blake2b", "merge_directories": true }, { "id": 1118996097, "type": "part_of_directory", "progress": 100, "checksum": "b852157b677b413966056e4b9650702789c9a0fd81138866d0e082097d1df43f170ccc4ef58d35f8e8c84cf006771692932b9b533d10ae92989d3c1b20c73a21", "path": "/tmp/annex/new/annex/.git/annex/objects/16/Kk/SHA256E-s1048576--f2c0e99f011039ae16380d40b44586161cc12cc79607cf41afe1d53af76a0260/SHA256E-s1048576--f2c0e99f011039ae16380d40b44586161cc12cc79607cf41afe1d53af76a0260", "size": 1048576, "depth": 7, "inode": 87000, "disk_id": 39, "is_original": true, "parent_path": "/tmp/annex/new/annex/.git/annex/objects/16/Kk/SHA256E-s1048576--f2c0e99f011039ae16380d40b44586161cc12cc79607cf41afe1d53af76a0260", "mtime": 1647582940.348223 }, { "id": 1686498469, "type": "part_of_directory", "progress": 100, "checksum": "b204885ea133974bf8d910a59d559f3c5bf62df34bf9b11475f92a150604171f7a2f765d54aead2b5ca88a8bdb6aea5caa2398df2f3394498a0a4c2ede75ebea", "path": "/tmp/annex/new/annex/.git/annex/objects/0f/w0/SHA256E-s1048576--60b5d64d265abc0d335821cb41ffdf65b7876b19529c9ae643c52e1b4e5e4f42/SHA256E-s1048576--60b5d64d265abc0d335821cb41ffdf65b7876b19529c9ae643c52e1b4e5e4f42", "size": 1048576, "depth": 7, "inode": 86999, "disk_id": 39, "is_original": true, "parent_path": "/tmp/annex/new/annex/.git/annex/objects/0f/w0/SHA256E-s1048576--60b5d64d265abc0d335821cb41ffdf65b7876b19529c9ae643c52e1b4e5e4f42", "mtime": 1647582940.3448896 }, { "id": 4152184763, "type": "part_of_directory", "progress": 100, "checksum": "b8f6a9e5307be15f998f898be7c6b1dceef1cb6ce0dc27e61175b6858997224b17df2e4130ccb32a4bd509ffa6079b489a9c6e48346772fd4becb59a0b18d1fb", "path": "/tmp/annex/new/annex/.git/annex/objects/jx/14/SHA256E-s1048576--2b658ba9c167d4c7067a7d097de14155ed3b37667f4c5f4dad74521105c6c90b/SHA256E-s1048576--2b658ba9c167d4c7067a7d097de14155ed3b37667f4c5f4dad74521105c6c90b", "size": 1048576, "depth": 7, "inode": 87001, "disk_id": 39, "is_original": true, "parent_path": "/tmp/annex/new/annex/.git/annex/objects/jx/14/SHA256E-s1048576--2b658ba9c167d4c7067a7d097de14155ed3b37667f4c5f4dad74521105c6c90b", "mtime": 1647582940.348223 }, { "aborted": false, "progress": 100, "total_files": 73, "ignored_files": 0, "ignored_folders": 0, "duplicates": 3, "duplicate_sets": 0, "total_lint_size": 6291456 }] ```

I suspect that rmlint will traverse the folders and detect in annex/plain and direct/plain 3 nodes file1 through file3 which contain equal content after dereferencing (-f). Thus at least these two folders should be marked as duplicates. The main folders annex and direct should even be considered duplicates as well, as all hidden files are masked away.