Significant performance difference between open_file and open_buffer

emield12 commented 2 years ago

I remarked that the speed of reading (and postprocessing) files with rawpy is significantly different depending on how you open the file. Passing a filename to the rawpy.imread() function it is 3 times slower compared to passing a file object, see benchmarks below: When a filename is passed, the open_file method is called which uses libraw to open and read the file, while when a file object is passed it uses the open_buffer method which directly reads the file with .read(). I did not yet investigate why opening the file with libraw is significantly slower.

letmaik commented 2 years ago

Interesting, I can't reproduce that on Windows at least. I tested with RAW_CANON_5DMARK2_PREPROD.CR2 and iss030e122639.NEF from the test/ folder. Can you share more about your environment?

emield12 commented 2 years ago

I did some more investigation and I only observe this problem when the files are placed on a network storage. When the files are stored locally, there is no performance difference. I am using a network attached drive with a gigabit connection, and I can clearly see that when the open_file method is used, the network bandwidth is not fully exploited (It averages around 11-12MB/s). However, when using the open_buffer method, it goes above 100MB/s. The file I use for testing is 60MB, so the difference in download time should be ~5s which is just approximately the observed difference. About my environment, I am using arch linux with rawpy==0.16.0. Is there something else you want to know?

letmaik commented 2 years ago

I think the behaviour then is expected to some degree. libraw doesn't load the full file in memory before processing it but rather accesses the bits it needs incrementally. This means there will be more random accesses compared to the single operation of loading the file fully and that is always slower, depending on I/O buffering done by the operating system. The advantage is it likely uses less memory during processing. Would be interesting to figure out what the difference is in practice. There may be scope for libraw to optimize data access further but it will always be slower than loading the file fully first. I think both cases are useful to support, especially if your raw image is very large. Realistically, libraw won't put much more effort into optimizing I/O so the only thing to do here is either add a note in rawpy's documentation or add some optional flag to imread that enables buffering the file fully like you do using open. I'm not a fan of the latter since it's an edge case and extending the API surface just for that seems suboptimal in terms of maintenance. A more extreme change would be to always use buffering in imread but I'm only comfortable doing that if the memory usage doesn't increase by much on all supported platforms (macOS, Linux, Windows). I know that some users also build rawpy manually to run on raspberry pi which has less memory in general.

letmaik commented 2 years ago

I'm closing this as it's working as expected and likely can be solved by documentation alone. Feel free though to put in a PR for that if you like, it's not a bad idea.

letmaik / rawpy

Significant performance difference between open_file and open_buffer #177