Finding duplicate symlinks not possible when in different paths

christianlupus commented 1 year ago

Currently, I am running into a problem with rmlint 2.10.1 when invoking it in a git annex folder.

The problem is that git annex uses symlinks to a storage folder (simply put) and relative links to the storage in the main folder. The problem is that rmlint seems no option to handle relative symlinks:

With -f the symlink is followed and all symlinks are considered a single file (sort of correct in the end).
With -F obviously, no symlinks are taken under account.
With -@ (the default behavior) the symlink is considered a plain text file with the link target as content. That sounds reasonable but it seems the link targets are taken from the link directly and not fixed with the path to get an absolute one.

Let me give you an example. Here is a script that will overwrite a test directory in the current directory and build a structure similar to git annex but without any hidden file magic.

#!/bin/bash

# Create empty test structure
rm -rf test
mkdir test
cd test
# Prepare storage folder
mkdir storage
for i in 1 2 3; do dd if=/dev/urandom of=storage/$i bs=1k count=1; done
# Prepare folder a and b
mkdir a b
ln -sr storage/* a
ln -sr storage/* b
# Prepare third folder
mkdir c c/d
ln -sr storage/* c/d
# Output structure
tree

# Test case 1
echo "Case 1"
rmlint -U -o pretty:stdout -o summary:stdout -o json:case1.json a b

# Test case 2
echo "Case 2"
rmlint -U -o pretty:stdout -o summary:stdout -o json:case2.json a c

Test case 1 is working as expected. One can add further options to rmlint but this not the problem here.

Test case 2 fails. The symlinks are of different depths in the folder structure, thus one has ../storage and the other has ../../storage as a relative path in the link.

You can play the same story with git annex but I wanted to have a minimal example without the need for external tools to test. The crucial path is the duplicate symlinks must be of different depths.

cebtenzzre commented 1 year ago

I am familiar with git-annex. Could you explain your actual use case in a little more detail? Data stored in git-annex is already deduplicated, of course. I assume you're not concerned with disk usage, but with finding pointer files in the annex that point to the same location.

I did recently consider matching links by realpath (resolved absolute target) and not readlink (raw target), but the current -@/--see-symlinks behavior is intentional - mainly because it allows directories with internal symlinks (e.g. dir/link -> dir/foo) to match with -D. It exists because of -D (2364c7b24025b74f55c6d268b3c14179eb989ee6) so I'd rather not make that change.

--followlinks could be adapted to be useful in your situation, by referencing the path of the symlink instead of its target in the output. I implemented the basic idea on my fork two years ago, and I have considered something similar (more like the hardlink code) as a fix for #586. Currently --followlinks causes problems with -D, because -f is used to output paths it wouldn't have traversed otherwise, but treemerge doesn't expect to see them. This change would bring -f for individual files in line with how treemerge already uses it.

However, that change to --followlinks would cause symlinks to be dupes of their targets in the output, which would not be useful for freeing up disk space, so by default it would probably mark symlinks as originals (like --keep-hardlinked).

christianlupus commented 1 year ago

I see that changing the default behavior is not the best option. I am with you there.

To get an idea of my use case: I have a (big) collection of personal images. This has grown over time and contains duplicates etc. I consider to do now two things:

Move to git annex to duplicate for storage efficiency.
Clean up the folder structure to be internally consistent. The latter comes from the fact that I copied over images from different USB storage media to and from the main HDD. Thus, there are partially duplicate directories in that bis archive but with modified names, missing files etc. This needs most probably manual cleanup. It can be helped with rmlint by looking for folders that contain duplicates and then manually inspecting these folders.

I can now either start with rmlint or git annex. The thing is: When starting with rmlint, I am working on the only data I have left unless I make a complete backup before. This is manageable but requires some effort. I tried it once and had to start over as I lost track of the state and was overrun by the vast amount of duplicates.

I thought it might be favorable to be able to run rmlint even after the files have been annexed. For example, if I find there is a folder that seems a duplicate, I can rerun rmlint and inspect the results. I wanted to test this before I annex everything and then find it impossible to clean up my mess.

Does this make it clear?

cebtenzzre commented 1 year ago

@christianlupus I finished implementing a new version of --followlinks that you could try. Would you like me to base the patches on master or develop? I'm working on top of a version of master that has other changes which I'm not ready to release.

christianlupus commented 1 year ago

Hello, @Cebtenzzre. This is great news for me. I am using ArchLinux as well, so I can build using AUR without issue (hopefully). If you name me a branch (and repo URL), I can build a version and try if it works as expected for me. Something more or less stable would be nice though :wink:.

cebtenzzre commented 1 year ago

Sorry about the delay - this feature ended up being a little more work than I would have guessed, and it took some effort getting the commits to apply and tests to pass on current master because these changes were originally written on top of about 300 commits that I don't think are ready to publish yet. 17 of those commits are included out of convenience - I haven't taken the time to assess which parts of them really need to be on the feature branch. Check out the new-followlinks-rebase branch. I haven't done extensive testing, but the testsuite passes at least. Probably not going to be merged into a main branch until I get some kind of bugfix release out for 2.10.

christianlupus commented 1 year ago

Hello once more. Sorry for the delay, I had some issues in testing the branch. Now, I am pretty sure, that it works not as expected (by me at least). With the updated branch, I should be able to use the --followlinks flag and rmlint should detect the symlinks to the same target as clones and remove them all but one, am I right?

I tried it both with the MWE from above as well as a newly installed (minimalistic) git annex repo. If I do not give any flags related to symlinks, it works if the symlinks do have the same depth. If I add the --followlinks flag, no duplicates are found anymore.

Is there anything else, that I can test or help with to get this fixed? Sorry again for the late answer.

cebtenzzre commented 1 year ago

Sorry, I described this in the commit message but not in this thread - you need --no-keep-symlinks for it to remove symlinks and not just consider them as originals to match against.

christianlupus commented 1 year ago

Ahh, sorry, I did not get that in the commit messages. I tested again and now I think, I got it working. In a short test, it seemed to work. I have not seen any issues so far. But, as I said, it was not thorough testing and no real-world example yet. This is still outstanding. It will take a few time as I will have to reevaluate my git-annex repos and do some manual checking if this is working as expected.

sahib / rmlint

Finding duplicate symlinks not possible when in different paths #589