[Feature Request] Identify duplicated directories

GwenTheKween commented 1 year ago

Imagine the following directory structure:

A:
  B:
    1; 2; 3;
  C:
    B:
      1; 2; 3;
    4; 5;
  D:
    B:
      1; 2; 3;
    4; 5;

Where letters represent directories and numbers represent files

It would be nice if czkawka could have an option to tell me that A/B and A/C/B are the exact same directory. Identifying that A/C and A/D would be a nice-to-have but is not required. I think it would be best served as a separate search mode, rather than part of finding duplicate files, because this would possibly be very costly and a bit confusing, but I leave up to you to decide.

Some background, in the past I did multiple backups of my computer, with duplicated directories (especially images from that I didn't know I had already backed up) and manually identifying that I dont want all hundreds of pictures in a specific folder that I accidentally backed up twice.

I am working around this using the terminal, but I imagine that regular users might have trouble with that work around

ilario commented 1 year ago

This would be very useful also for me. I often find many duplicated files and delete them one by one while maybe the whole directory is a duplicate... @billionai can you share your shell solution?

I am using this comparison that just matches folders with the same filenames inside and their sizes. It does not check if the files content is the same. Obviously it is very inefficient (very very, it calculates the size of the same files again and again and again), and it could not work on your system (maybe the option -s of ls is not supported on every system...).

# avoid starting with a non-empty dirhash file
rm dirhash
# while reads all the folders list provided by find
# cd enters the folder
# ls lists the content, -1 on only one column, -a including the hidden files, -s indicating the files size (check if this works on your system), -R recursively entering every subdirectory
# xxhsum calculates a string from the output of ls
# the XXH3 hashing algorithm was the one suggested on
# https://xxhash.com/#benchmarks.
# du calculates the total size of the folder
while read d; do cd "$d"; hash=$(ls -1asR | xxhsum -H3); size=$(du -s); cd "$OLDPWD"; echo $hash $size $d >> dirhash; done <<< $(find ./ -type d)
# sort sorts the file based on the dir size, and then on the hash
# uniq shows all the duplicates based on the first 38 characters, which include the hash and the size from du
sort -k5n,5 -k4,4 dirhash | uniq --all-repeated=separate -w 38

Hoping for comments, I also posted this code here: https://unix.stackexchange.com/questions/757224/detect-duplicate-folders-with-identical-content/757225

GwenTheKween commented 1 year ago

I can't remember if I actually ever finish this or not, but I sketched out something here. I used a similar idea to you, @ilario , but with a few important changes. Instead of using xxhsum, I use sha256sum. I think this was the missing link to you, since it can hash the contents of a file.

My naive approach, that identifies correctly A/C/B, A/B and A/D/B as all the same file, and also identifies A/C and A/D as being the same, but the implementation is very inefficient, it just recursively hashes directories and you have to see if the hashes are equal manually, along with needing to call this for A, A/B, A/C, A/C/B and so on. I'm sure you can do this better, but I don't know enough bash to do it, I'd probably do it in python to have a dictionary. Here's my dumb code:

➜  ~ cat repeat_dir.sh 
#! /bin/bash

function recursive_hash {
    local cur=$(pwd)
    cd "$1"
    local new_count=$(expr $2 + 1)
    local all_hashes=""
    for file in *; do
        if test -d "$file"; then 
            local hash=$(recursive_hash "$file" $new_count)
        else
            local hash=$(sha256sum "$file" | cut -d' ' -f 1)
        fi
        all_hashes+=$hash
    done
    cd ..
    final=$(echo $all_hashes | sha256sum | cut -d' ' -f 1)
    echo $final
}

recursive_hash $1 0
➜  ~ ./repeat_dir.sh Downloads
bc10aab7f6ff944647a7b1d82b4355752319d8d4d06d03192228605bb716b2ad

ilario commented 1 year ago

Thanks for sharing the code! I also started creating one (in BASH) that hashes the content of the folders, will share as soon as is decent XD

In the meanwhile I found that rmlint has such feature, in case a reference implementation can be of use: https://rmlint.readthedocs.io/en/master/tutorial.html#finding-duplicate-directories It basically groups duplicate files, and when the grouping results in the whole folder (ignoring empty files) it reports the folder as duplicate.

This would work if joined in Czkawka's duplicate files tab. If a specific duplicate folders tab was to be created, faster approaches are surely available (e.g. in my bash script I am checking the size of the folder with du before hashing the content).

ilario commented 11 months ago

As promised, here is my shell code for detecting the duplicate folders.

https://github.com/ilario/finddirdupes

Obviously, is much slower than rmlint -D.

sdx23 commented 9 months ago

duplicate issue for reference: #676

qarmin / czkawka

[Feature Request] Identify duplicated directories #976