pkolaczk / fclones

Efficient Duplicate File Finder
MIT License
1.91k stars 71 forks source link

Duplicate search between, but not within, two distinct directories? #65

Closed felciano closed 2 years ago

felciano commented 3 years ago

Is it possible to use fclones to find duplicates between, but not within, two directory trees? Here's an example:

destination/
  2021/
    January/
      A.jpg

source/
  A1.jpg <-- copy of destination/2021/January/A.jpg (also same as A2.jpg)
  A2.jpg <-- copy of destination/2021/January/A.jpg (also same as A1.jpg)  
  B1.jpg <-- same as B2.jpg
  B2.jpg <-- same as B1.jpg

I want to identify A1.jpg and A2.jpg under source as duplicates of A.jpg in destination.

B1.jpg and B2.jpg are also duplicates but only under sources. They should be excluded from the match list because they don't match anything in destination.

FWIW, the use case is a source folder of images that have previously been processed by scripts to rename them and sort them into a destination directory structure (e.g. by year and month, or by other EXIF metadata). Then we come across a new folder of images, some of which may have been processed previously, and we want to know if we can safely delete them because we already have copies in the destination directory.

pkolaczk commented 3 years ago

Unfortunately it is not supported yet. But I think it would be useful.

cyounkins commented 2 years ago

I did something similar where I wanted to filter the groups based on if they had certain paths. I was able to use jq and the following may be helpful. Note that import of JSON was added since this work was done. The comment is accurate to the best of my current knowledge but it's been awhile. Hope it helps.

$ fclones group dir1 dir2 dir3 dir4 --format json > dupes.json

# Include entire groups where only 1 file path has 'dir1' in it
$ jq -s '.[0] | {header: .header, groups: [.groups[] | select([.files[] | contains("dir1") | select(.)] | length == 1)]}' < dupes.json > dupes_filtered.json

$ jq -s '.[0]' < dupes.json > dupes_sorted.json

$ diff dupes_sorted.json dupes_filtered.json | less

$ jq '.groups[].files[]' < dupes_filtered.json | grep 'dir1' | sed 's/"//g' | sort > files_to_delete.txt

$ while read -l f; echo $f; rm "$f"; end < files_to_delete.txt

$ find . -type d -empty -exec -delete
pkolaczk commented 2 years ago

@felciano @cyounkins here is the PR that should solve it. My only concern is the name of the flag that enables this feature. Maybe you could come up with a better name? Currently it is named -X / --diff-roots.

felciano commented 2 years ago

Thanks @pkolaczk happy to help. I suspect good naming is important for this one. Do you have a specific / technical write up of the exact function, or a draft of the docs for this feature?

By roots, do you mean the directories passed into fclones itself? For example, in cyounkins example above, would there be 4 roots (dir1, dir2, dir3, dir4)? If so, then I suspect it will be helpful avoid confusion by using the same term (e.g. "root dir") for both this flag as well as anywhere else in the docs where such directories are referenced.

Assuming you settle on root-dir as the term to use, you could use something like --allow-dups-within-same-root-dir. This is a bit long, but the verb and the descriptive text makes it clear what the switch is enabling/disabling. (I'm not sure whether within, under or in is best to indicate the "sits in the directory tree under the root" relationship.)

cyounkins commented 2 years ago

@pkolaczk The commit and proposed flag just affects the group operation, right? Does it output one path or multiple paths under the root with dupes?

It may be helpful to describe my use case. I have a server that I rsync my laptop to. When I get a new laptop I make a new directory and am lazy about cleaning up the old rsync directories. I end up with laptop 2012, laptop 2016, laptop 2020, etc with most of my photos and projects duplicated between them. I wanted to delete any files found in laptop 2012 or laptop 2016 where a copy existed in laptop 2020. I also did not want to dedupe within laptop 2020.

I think the proposed flag could find dupes that occur across laptop 2012 and laptop 2016 that are not in laptop 2020. I would not want to delete those in my example. --keep-path would not help here, because the group would not contain a path with laptop 2020.

bitclick commented 2 years ago

This use case is the main reason why i use those programs.

A few years ago I wrote a python script to filter the output of fdupes for this use case.. I worked very well for me. Until i got a freenas where i just dump everything.. I have attached it here as an example. Please excuse the bad code quality. What i would extend is, the ability to collapse groups of files to whole subfolders if possible. This would speed up the decision making considerably.

Maybe i will try my script with fclones.. :slightly_smiling_face:

The script allows to filter a fdupes output file and look for files within a specific directory that can be deleted safely, meaning that have dupes outside that directory. I could scan multiple roots with fdupes and afterwards look for folders that can be deleted savely - without doing the duplicate-scanning in every query. filterdupes.py.gz

pkolaczk commented 2 years ago

The commit and proposed flag just affects the group operation, right?

Correct. It won't report groups of files if all files in the group belong to the same directory (given as a single argument to group) - all files in the same directory are counted as one.

Does it output one path or multiple paths under the root with dupes?

It still outputs multiple paths. I think it is better to output all paths and let the user decide what to do with them, instead of arbitrarily picking only subset of the files. It also allows to clean all copies in the same dir.

Consider this scenario:

dir1/
   a/ 
      foo.txt
   b/
      foo.txt

dir2/
   foo.txt

fclones group -X dir1 dir2 will return a single group of 3 files:

dir1/a/foo.txt
dir1/b/foo.txt
dir2/foo.txt

However, if there wasn't dir2/foo.txt, the result would be empty, because dir1/a/foo.txt and dir1/b/foo.txt would be considered as non-duplicate (single file).

I think the proposed flag could find dupes that occur across laptop 2012 and laptop 2016 that are not in laptop 2020

Correct. It seems you want support for arbitrary expressions to model the relationship between 3 directories (where 2 of them should be treated as a single group). I'm afraid that would make the command line interface more complex (internally, it is trivial to do, we'd just need one more level of grouping root directories, so you'd put laptop_2012 and laptop_2016 in a single logical root, but designing a CLI interface so that it is not a bunch of special cases is quite challenging - I'm open to ideas).

The feature implemented here can do it, but you have to invoke fclones twice. So you'd do to clean files from 2012:

fclones group -X laptop_2012 laptop_2020 >dupes_2012.txt
fclones remove <dupes_2012.txt --path 'laptop_2012/**

And then use the similar recipe to clean files from 2016:

fclones group -X laptop_2016 laptop_2020 >dupes_2016.txt
fclones remove <dupes_2016.txt --path 'laptop_2016/**

Is it ok for now?

cyounkins commented 2 years ago

@pkolaczk it was very kind of you to examine my use case. Your solution is a correct reduction of the problem and solves it well.

It seems you want support for arbitrary expressions

No, I agree such complexity is not necessary.

Overall this looks great, thanks!

felciano commented 11 months ago

@pkolaczk is this enhancement in production? I am running 0.32.1 and getting an Unexpected Argument error for the -X flag

felciano commented 11 months ago

Never mind. I just found it in the 2021 releases. In case anyone else reads this, the flag is now -I intead of -X.