hydrusvideodeduplicator / hydrus-video-deduplicator

Video Deduplicator for the Hydrus Network
https://hydrusvideodeduplicator.github.io/hydrus-video-deduplicator/
MIT License
41 stars 7 forks source link

Discussion: using local filepaths instead of API-fetched files? #20

Closed prof-m closed 1 year ago

prof-m commented 1 year ago

Okay, first off, hot damn @appleappleapplenanner, this is really cool! Thank you for making and sharing this thing - I only got as far as designing architecture before I got distracted by other stuff, and to come back and find that someone else had both done the thing and shared the thing is incredible, I'm sincerely pumped and can't wait to try it out.

From what I understand so far, what your program does is fetch the entire file from the API, write it to a temp file, do stuff to it to generate the perceptual hash, store the perceptual hash, and then repeat ad infinitum until you've got a perceptual hash for every file.

When I was designing the same idea, I had decided to only use the API for querying and setting duplicates, but not for fetching the actual files. Instead, I'd planned to have the program take in the hydrus DB folder path as a command line argument, and just run ffmpeg on the original files. (Admittedly, I didn't really have to worry about transcoding - or at least, I didn't expect I would, I never actually tried)

There are definitely downsides to that approach - you'd have to run the program within the same filesystem as the hydrus DB folder, the program might stop working in hydev changes the way hydrus stores files, etc - but the upside would be that you could go from original file to individual frames for phashing entirely in memory, which seemed like it'd probably be more scalable for larger libraries.

Do you think it might be worth adding an "local-only" option to your tool that would try to find the files in the local filesystem? Or were there other reasons you chose to go with a pure API approach for fetching files? Like, I can already see it being kind of a pain with WSL, cause I know from experience that accessing windows files from the WSL and vice versa can often be a pain in the ass. But if you think it could be worth a shot and you're cool with it, I might see if I can implement a local-only option and submit a PR for it! (emphasis on the "might" - my excitement for the idea is high, but my time availability varies wildly 😅 )

appleappleapplenanner commented 1 year ago

Currently, I'm not considering the local-only option.

With the local-only option, you lose valuable information such as queries which are extremely valuable. You also have to rescan every file/folder every time you run the program and then you're also dependent on filesystem and how Hydrus stores files and a lot of tricky stuff I don't want to deal with. It would not feasibly work for large databases.

The limiting speed factor right now is hashing files, not IO. An SSD can write gigabytes per second, but vpdq can't hash that fast. I haven't even parallelized hashing multiple files at the same time yet which would drastically improve speed. Besides, most modern OS's probably don't even write most videos to disk if they are deleted fast enough.

The only advantage would be, like you said, not writing to a temp file. Writing every file is not ideal or even acceptable to me, so I'm working on a solution.

The code by Meta for vpdq, the perceptual hasher, is bad. It's really bad in about every way imaginable. I'm working on porting it to Python currently here, that way I can store the files as a SpooledTemporaryFile. It should be nearly the same speed as the CPP implementation and it will move the project towards cross platform compatibility.

prof-m commented 1 year ago

To clarify, what I was proposing was an approach that would still use the API for most things - so you'd still get queries, etc - it would just do the actual fetching of the file directly from the local filesystem. But you make some great points about why limiting speed right now is file hashing and not I/O, and the solution you're working on to avoid the file write each time makes plenty of sense if you can re-implement what you need in pure Python. Thanks for taking the time to share your thoughts!