Open KennethSamael opened 3 months ago
Adding a dynamic thread count that adjusts based on the available memory to file size ratio is feasible.
Alternatively, we could also concurrently hash the frames of one video rather than concurrently hashing multiple videos. This is what the C++ implementation does.
Streaming files is out of the question since it would probably be unstable, slow, and complicated. Media containers are tricky and I don't want to introduce any possible parsing issues with that.
Recently found myself running out of RAM during phashing. And I'm guessing it's because during phashing, the entire binary content of each video file is loaded into memory before processing.. Since video files can easily take up several GB each, and since many files are processed in parallel, it's surprisingly easy to run out of memory. Doesn't help that queries are sorted by filesize, ensuring that the largest files in a query will be processed at the same time. For now, I can avoid this issue by setting a lower job count, but it's not very user-friendly to require users to manually estimate how many jobs they can run based on the filesizes in their collection and available memory. And it's not inconceivable for someone to have video files that are larger than their total available memory, which would make those files impossible to process. I can think of two obvious solutions, but both have drawbacks:
video_response.raw
instead ofvideo_response.content
, but I've messed with this in the past, and I recall that it's not always as straightforward as it should be. Something about some video formats not presenting data in the same order pyav wants to read it, I believe.