jvirkki / dupd

CLI utility to find duplicate files
http://www.virkki.com/dupd
GNU General Public License v3.0
112 stars 16 forks source link

Feature request: Incremental updates using a file watcher #40

Open williewillus opened 3 years ago

williewillus commented 3 years ago

It would be nice if dupd could run in some sort of daemon. You give it a DB name, and some way of receiving live updates to changes happening in the root, and it automatically and incrementally updates the DB according to these changes. No need to read any data from disk.

This is useful when the dataset is extremely large. I have a 3.3T disk I'm trying to dedupe and running dupd refresh is pretty much as slow as a full rescan.

For receiving live updates, we can either use a premade project like watchman, or roll our own on each platform with inotify, kqueue, etc. (which is what watchman abstracts over).

A caveat is that Watchman only works on the three major OS'es though.

jvirkki commented 3 years ago

running dupd refresh is pretty much as slow as a full rescan.

Did you mean refresh or dupd validate?

The refresh operation should be quick, it doesn't do much (see man page for details). All it does is remove files previously marked as duplicates in the dupd db which have been deleted from the fileystem. It does need to stat(2) every duplicate file. On a remote filesystem (e.g. NFS) this could be slow, but otherwise it should be way faster than a scan.

(As an example, on a ~300GB, 4M file data set I have here, where a scan takes ~10 minutes, refresh only takes a couple seconds.)

If you really did mean refresh, I'd be curious to know how long a scan takes vs. how long a refresh takes? Is your data set mostly very tiny files which are majority duplicates?

williewillus commented 3 years ago

Hmm, it might just be the scale of data I'm trying to deal with. I'm trying to clean up an old HDD full of junk. There's about 3T of data and I suspect almost a quarter of it is duplicate.

I didn't time my initial scan, but I had to leave it overnight to complete. I tried doing a refresh yesterday, but killed it after about an hour or so.

williewillus commented 3 years ago

Hmm, this might just be hard in general. I tried setting up Watchman alone and it's taking forever to even just register notifications for the massive drive.

jvirkki commented 3 years ago

The runtime of refresh command is linear to the number of known duplicates in the db, as all it does is stat each file. How many duplicates does this set have? This will show the total:

echo "select sum(count) from duplicates" | sqlite3 ~/.dupd_sqlite