qarmin / czkawka

Multi functional app to find duplicates, empty folders, similar images etc.
Other
18.38k stars 606 forks source link

Find "Duplicate folders" with exact same file content and "Similar folders" with most (some %) of files being same #1182

Open avibathula opened 5 months ago

avibathula commented 5 months ago

Find duplicate folders Many folks just make copies of their imp folders like Desktop, Downloads, Documents, Pictures, Music etc,. which means a good portion of the content remains unchanged.

Given that czkawka is already finding duplicate files I wonder if we can identify if two folders have exactly same files or atleast some % of files as same. If the logic can be applied recursively and thus allowing us to identify the further most parent folders of two same/similar folder trees, and if we pair up with one being able to sort based on largest size, it can be immensely helpful to address the problem of duplicates taking too much storage space.

Krzysiu commented 5 months ago

I like that, I had to recently make a Python script to do that. Alas, I hadn't idea how to implement "close, but not exact copies" hashing system.

avibathula commented 5 months ago

There are hashing techniques that can not only tell if two pieces of data are identical, but also provide a measure of how much they differ from eachother.

Example: SimHash (similarity hashing), MinHash, Jaccard similarity (mathematical measure used to quantify the similarity between two sets or lists of elements) etc.

Krzysiu commented 5 months ago

Yeah, I know, thanks, but my problem was implementing it for specifically for directories - I think I'd have to perpetual hash contents of both directories and then compare it somehow - keeping in mind there are elements that doesn't fit set.

wwcanoer commented 2 months ago

Your request is implemented in Jam Software's SpaceObServer "Similar Folders" tab. It lists pairs of folders and their "% similar". When you click a pair, in the bottom pane, the two directories are compared similar to "Beyond Compare", with different colors depending on if the files are identical or same name with different size, date or MD5.

Unfortunately, it is expensive (over $200) since designed for corporate servers and is excruciatingly slow. There's a 30 day trial that I have installed and now waiting a couple days for results on 1.7 TB of data that has a lot of duplicates.

Also, it can only look at one drive. It can't compare drives. When I tried it before, my duplicate files were spread across many backup drives, so not useful. Now I have consolidated all similar folders on one drive. Waiting to see if it will be worth the wait and enable me to deduplicate that drive faster than Duplicate Cleaner.

wwcanoer commented 2 months ago

Duplicate Cleaner has a good Duplicate Folders feature. It will identify duplicate folders that are several layers deep. The duplicate folder has a number of duplicate files but may have additional non-duplicate folders. When a folder is selected, the right pane will show it's contents but I think only the duplicate files, not the non-duplicates, so not ideal for choosing which to keep.

Default is sorting the largest folder first, so if you select one of those for deletion, it will automatically select all every instance of it's subfolders in the long duplicate folders list.

It works great when there's only a few pages of duplicate folders, but when I get hundreds of pages, then tough to decide which to delete. I need to use other programs, like BeyondCompare, to actually compare the file trees to see which one I want to keep/delete.

To compare folders of the same name, I search a name (ex. My Documents) or substring in Everything (search) and then right-click copy that list and paste it in DuplicateCleaner, which then will find duplicates in all of those folders at once.

I have periodically searched for good duplicate or similar folders software but Duplicate Cleaner and SpaceObServer are the only two real options that I have found so far.

I use Everything, WinCatalog, Duplicate Cleaner, Beyond Compare, XYplorer, TreeSize free and periodically SpaceObServer Trial to dedup. Plus an excel VBA program to find move similar folders from diverse drives to one drive (ex. Find every "My Documents" folder and move it to a folder that has the parental path concatenated into a single string (so that I know where it came from). So then I have a list of folders like "Backup Drive 03 - Backup 2010-01-01 - Drive C - My Documents" all on one nvme drive so that I can dedup and consolidate them in one place with SpaceObServer. (vs having many slow backup USB drives connected at the same time.)

m-stefanski commented 3 weeks ago

This would be extremely useful for my use case - several snapshots of huge directories that were later modified independently and now have to be deduplicated / reconciled.

Kindly consider adding this to roadmap.

m-stefanski commented 3 weeks ago

I like that, I had to recently make a Python script to do that. Alas, I hadn't idea how to implement "close, but not exact copies" hashing system.

@Krzysiu would you mind sharing it? It would help me tremendously but I would rather not reinvent the wheel. I would much rather steal fork it from you.