hydrusvideodeduplicator / hydrus-video-deduplicator

Video Deduplicator for the Hydrus Network
https://hydrusvideodeduplicator.github.io/hydrus-video-deduplicator/
MIT License
41 stars 7 forks source link

Support Deferred Sending of Duplicate Results #34

Open prof-m opened 1 year ago

prof-m commented 1 year ago

PR for Issue #33

I went for a clean, but fairly MVP/"beta feature" approach here. Specifically - there aren't actually any new CLI arguments for using the potential dupes queue in this PR. It's all done through environment variable, so your average user would have to look at the code to try and take advantage of it. I figure we can add CLI args later if there's sufficient interest, but for now keep it a secret menu option for some folks to test drive (I'll post about it on the discord if/when we merge this PR)

Like we talked about awhile ago now, I implemented a separate 'duplicates queue' class that contains most of the code to handle most of the work, with as few code changes to the core code in dedup.py as possible. The dupe queue object stores potential duplicates in memory, up until the queue reaches a pre-determined size. When it hits that size, the object flushes the queue to disk via a separate database connection. After all videos have been searched for duplicates, the queue tries sending it's stored dupes to the hydrus client (in batches the same size as were flushed to the db, for simplicity). If it fails (due to the client being unavailable, most likely), it saves any unsent duplicates back to the database for retrieval on future runs.

All of the interactions between the in-memory queue, the database, and the hydrus client are written to be as safe as possible - specifically to reduce the risk that potential duplicates are lost without being sent. The code errs on the side of keeping relationships unless it's sure it's sent them successfully - taking advantage of the fact that Hydrus silently ignores API attempts to mark two files as potential duplicates if said files already have a recorded relationship.

Testing this code was pretty complicated, because of the parallel processing we do for finding dupes. But what I ultimately came up with allowed for entirely atomic collection and storage of potential relationships, so no thread will ever be waiting on another thread for reasons related to the PDQ.

Also, since we were passing around the relationship object a lot more, I formalized it into a proper Relationship class (that acts just like a dict, but with typing and default values)

prof-m commented 1 year ago

I should also go in and give proper pydoc comments to more of the new functions