Phashing large files eats a lot of memory

Recently found myself running out of RAM during phashing. And I'm guessing it's because during phashing, the entire binary content of each video file is loaded into memory before processing.. Since video files can easily take up several GB each, and since many files are processed in parallel, it's surprisingly easy to run out of memory. Doesn't help that queries are sorted by filesize, ensuring that the largest files in a query will be processed at the same time. For now, I can avoid this issue by setting a lower job count, but it's not very user-friendly to require users to manually estimate how many jobs they can run based on the filesizes in their collection and available memory. And it's not inconceivable for someone to have video files that are larger than their total available memory, which would make those files impossible to process. I can think of two obvious solutions, but both have drawbacks:

Pass the filepath to pyav so it can read the file directly from disk instead of loading it into memory first. Simple and straightforward, except that there's currently no way to determine the filepath of a file in hydrus. And there might be people running hydrus on a remote device.
Stream the file content instead of fetching it all at once. Theoretically this could just be a matter of using video_response.raw instead of video_response.content, but I've messed with this in the past, and I recall that it's not always as straightforward as it should be. Something about some video formats not presenting data in the same order pyav wants to read it, I believe.

hydrusvideodeduplicator / hydrus-video-deduplicator

Phashing large files eats a lot of memory #62