hydrusvideodeduplicator / hydrus-video-deduplicator

Video Deduplicator for the Hydrus Network
https://hydrusvideodeduplicator.github.io/hydrus-video-deduplicator/
MIT License
41 stars 7 forks source link

Phashing large files eats a lot of memory #62

Open KennethSamael opened 3 months ago

KennethSamael commented 3 months ago

Recently found myself running out of RAM during phashing. And I'm guessing it's because during phashing, the entire binary content of each video file is loaded into memory before processing.. Since video files can easily take up several GB each, and since many files are processed in parallel, it's surprisingly easy to run out of memory. Doesn't help that queries are sorted by filesize, ensuring that the largest files in a query will be processed at the same time. For now, I can avoid this issue by setting a lower job count, but it's not very user-friendly to require users to manually estimate how many jobs they can run based on the filesizes in their collection and available memory. And it's not inconceivable for someone to have video files that are larger than their total available memory, which would make those files impossible to process. I can think of two obvious solutions, but both have drawbacks:

  1. Pass the filepath to pyav so it can read the file directly from disk instead of loading it into memory first. Simple and straightforward, except that there's currently no way to determine the filepath of a file in hydrus. And there might be people running hydrus on a remote device.
  2. Stream the file content instead of fetching it all at once. Theoretically this could just be a matter of using video_response.raw instead of video_response.content, but I've messed with this in the past, and I recall that it's not always as straightforward as it should be. Something about some video formats not presenting data in the same order pyav wants to read it, I believe.
ianwal commented 3 months ago

Adding a dynamic thread count that adjusts based on the available memory to file size ratio is feasible.

Alternatively, we could also concurrently hash the frames of one video rather than concurrently hashing multiple videos. This is what the C++ implementation does.

Streaming files is out of the question since it would probably be unstable, slow, and complicated. Media containers are tricky and I don't want to introduce any possible parsing issues with that.