jvirkki / dupd

CLI utility to find duplicate files
http://www.virkki.com/dupd
GNU General Public License v3.0
112 stars 16 forks source link

Feature Request : Save hash to Report DB. #25

Open rosyth opened 4 years ago

rosyth commented 4 years ago

Since hashing already is getting done, why not save the hash to the report database. This would allow me to merge by hash two separate dupd runs on different external drives.

I can import the sqlitedb's into python/pandas (since I'm not familiar with SQL) merge them and get a new list of possible duplicates. eg.. ` import pandas as pd

import sqlite3

con = sqlite3.connect("dupd.db3")

dupx = pd.read_sql('SELECT * FROM duplicates WHERE each_size > 10000;', con) ` I've tried to modify the code myself, to add hashes, but not having used C for 20 years, it's not been very successful.

I suspect it would not be difficult, and possibly quite useful to other users too.

rosyth commented 4 years ago

Well eventually I've done it with some hack job.
It would be nicer if it were done properly by the person who actually knows what they're doing. Based on the latest release : 2.0-dev where, 'dupd_latest/dupd' is the release version and 'dupd' is the modified.

diff -rw dupd_latest/dupd dupd --exclude=tests --exclude=*.git
Only in dupd: build
Only in dupd: dupd
diff -rw '--exclude=tests' '--exclude=*.git' dupd_latest/dupd/src/dbops.c dupd/src/dbops.c
112c112
<                         "each_size INTEGER, paths TEXT)");
---
>                         "each_size INTEGER, paths TEXT, hash TEXT )");
420c420
< void duplicate_to_db(sqlite3 * dbh, int count, uint64_t size, char * paths)
---
> void duplicate_to_db(sqlite3 * dbh, int count, uint64_t size, char * paths, char * hash)
422c422,425
<   const char * sql = "INSERT INTO duplicates (count, each_size, paths) "
---
> 
>   const char * sqly = "INSERT INTO duplicates (count, each_size, paths, hash) "
>                      "VALUES(?, ?, ?, ?)";
>   const char * sqlx = "INSERT INTO duplicates (count, each_size, paths) "
424a428,429
>   int hash_len = strlen(hash);
>   const char * sql = ( hash == 0 ? sqlx : sqly );
440a446,451
> 
>   if( hash != 0 ) {
>     // printf("++++++++++++++ Hash %d -> %s\n", hash_len, hash);
>     rv = sqlite3_bind_text(stmt_duplicate_to_db, 4, hash, -1, SQLITE_STATIC);
>     rvchk(rv, SQLITE_OK, "Can't bind file hash: %s\n", dbh);
>   }
diff -rw '--exclude=tests' '--exclude=*.git' dupd_latest/dupd/src/dbops.h dupd/src/dbops.h
135c135
< void duplicate_to_db(sqlite3 * dbh, int count, uint64_t size, char * paths);
---
> void duplicate_to_db(sqlite3 * dbh, int count, uint64_t size, char * paths, char * hash);
diff -rw '--exclude=tests' '--exclude=*.git' dupd_latest/dupd/src/filecompare.c dupd/src/filecompare.c
76c76
<   duplicate_to_db(dbh, 2, size, paths);
---
>   duplicate_to_db(dbh, 2, size, paths, 0);
diff -rw '--exclude=tests' '--exclude=*.git' dupd_latest/dupd/src/hashlist.c dupd/src/hashlist.c
326a327,332
>   char hash_out[HASH_MAX_BUFSIZE];
>   char * strhash;
>   char * strp ;
>   char * hashp = hash_out;
>   int hsize = hash_get_bufsize(hash_function);
>   
372,373d377
<           int hsize = hash_get_bufsize(hash_function);
<           char hash_out[HASH_MAX_BUFSIZE];
382a387
>         strp = memstring("hash", p->hash, hsize);
389,390c394,395
<       duplicate_to_db(dbh, p->next_index, size, pbi->buf);
< 
---
>       duplicate_to_db(dbh, p->next_index, size, pbi->buf, strp);
>       free(strp);
diff -rw '--exclude=tests' '--exclude=*.git' dupd_latest/dupd/src/refresh.c dupd/src/refresh.c
132c132
<         duplicate_to_db(dbh, new_entry_count, entry_each_size, new_list);
---
>         duplicate_to_db(dbh, new_entry_count, entry_each_size, new_list, 0);
diff -rw '--exclude=tests' '--exclude=*.git' dupd_latest/dupd/src/utils.c dupd/src/utils.c
300a301
> 
307c308
<     printf("%s: ", text);
---
>     printf("%s: %d: ", text, bytes);
314a316,332
> }
> 
> char * memstring(char * text, char * ptr, int bytes)
> {
>   int i;
>   unsigned char * p = (unsigned char *)ptr;
>   int space = ( strlen(ptr)*3 + 2 );
>   char * optr = (char *) malloc((1024) * sizeof(char));
>   char * xptr = optr ;
> 
>   for (i=0; i<bytes; i++) {
>     xptr += sprintf(xptr, "%02x ", *p++);
>   }
>   //printf("\n-----------> memstring >> %s <-------------\n", optr);
>   //memdump(text, ptr, bytes);
>   //printf("~~~~~~~~~~~~\n");
>   return optr;
diff -rw '--exclude=tests' '--exclude=*.git' dupd_latest/dupd/src/utils.h dupd/src/utils.h
239a240,241
> char * memstring(char * text, char * ptr, int bytes);
> 

So, not much changed, but then, I'm not sure about the hidden (if any) side effects.

jvirkki commented 4 years ago

Thanks for using dupd!

Saving the hashes of duplicates is easy enough, but I'm not sure if it is useful?

Hashes are computed only for files known to be duplicates (if they can be rejected earlier, the full file is not read so the hash isn't computed).

If you compare the known-duplicate hashes from two different systems, there is no guarantee that will find any duplicates even if they exist. That's because files which are duplicates on the two systems won't have a hash present unless both of them also have duplicates in the local system. So comparing across systems that way will only match a somewhat random subset of files, if any.

(If the two external drives are mounted on the same system, run dupd with multiple -p options to point at both paths which will solve that use case.)

In general, to do a duplicate find across separate systems requires computing all the hashes for all files. Easy enough with just find & sha1sum but it'll be very slow.

priyadarshan commented 4 years ago

If I may butt in, one use-case for storing hashes for all files: checking for duplicates on completely separate systems, especially with completely different paths, with the intent of keeping certain subsets on chosen machines (ie, keeping some parts duplicated, and others not).

Admittedly an uncommon use-case one would not expect dupd to solve. Still, it is a use-case which the non-profit I volunteer for has been facing for some time.

rosyth commented 4 years ago

Hi, Yes of course you are correct, that comparing individual dupd runs with a hash will only catch duplicates on both drives. However I was considering creating a separate file list with xxhash output to compare to the original too, (also to pump into pandas). As you say something like, find . -type f -printf "%s:%p:" -exec xxh64sum {} \; > filelist.xxh64 will do the job. No need for crypto crcs for this type of job. But you are probably correct that my use case is a bit muddled, but that's from an accumulation of several backup drives and a few disk failures over the last five years that I did nothing with and I now want to re-organise (lockdown :-)). Still apart from a little extra filespace overhead, it doesn't do any harm to save the hash too.

And thanks for the clarification, it wasn't quite clear to me that the --path switch was cumulative.

jvirkki commented 4 years ago

Bit of trivia: dupd is named as a daemon (ends in 'd') even though it is not, because during initial implementation my plan was for it to be a daemon which coordinates duplicate finding across systems. That turned out to be too slow to be interesting so I focused on the local disk case but didn't change the name.

I'd still love to solve for the multiple systems problem if there is an efficient way that is much better than simply using find | sort | uniq.

@rosyth - dupd currently does save the hash of some files, but only large ones. You could get these from the .dupd_cache db with something like:

select files.path,hash from files,hashes where hashes.id = files.id;

There's a performance cost to saving these though, so they're only saved for large files.

jvirkki commented 4 years ago

And thanks for the clarification, it wasn't quite clear to me that the --path switch was cumulative.

The manpage covers this, but if there's anything that the manpage doesn't make clear please let me know so I can add more clarity.

priyadarshan commented 4 years ago

I'd still love to solve for the multiple systems problem if there is an efficient way that is much better than simply using find | sort | uniq.

This is inspiring to hear, as it was the same direction I was heading.

Would it be fine to open a new ticket for your consideration, presenting our use case, or shall I clarify here?

jvirkki commented 4 years ago

Feel free to file another ticket with specific use case details.

I'm not entirely convinced it's possible though. Trying to coordinate partial file matches over the network (particularly if more than two systems are involved) would likely introduce so much delay that it's just faster to hash everything and compare later. At that point dupd doesn't add any value since it can be done in a trivial shell script. But I'd love to be proved wrong.

rosyth commented 4 years ago

And thanks for the clarification, it wasn't quite clear to me that the --path switch was cumulative.

The manpage covers this, but if there's anything that the manpage doesn't make clear please let me know so I can add more clarity.

Yes, I see that now, thanks, RTFM always applies.