Open rhpvorderman opened 1 year ago
To complement the rambling above, I will make a systematic overview here to show why blocksoutputbuffer is inefficient.
The arrange output buffer function works by simply creating a bytes object at 16K. Then scaling it up by doubling to 32k, 64k etc. At the end of the function it resizes the bytes object to the size of the contents.
The BlocksOutputBuffer works by creating a list of size 1 and adding a bytes object. When the bytes object is full, a new bytes object is added. The last bytes object added is resized at the end, and a PyBytes_Join is called on the list.
size of output | BlocksOutputBuffer | arrange_output_buffer |
---|---|---|
<16k | 1x PyBytes_New; 1x _PyBytes_Resize; 1x PyList_New | 1x PyBytes_New; 1x _PyBytes_Resize |
<32k | 1x PyBytes_New; 1x _PyBytes_Resize; 1x PyList_New | 1x PyBytes_New; 2x _PyBytes_Resize |
< 64k | 2x PyBytes_New; 1x PyList_Resize; 1x _PyBytes_Resize; 1x PyList_New; 1x PyBytes_Join | 1x PyBytes_New; 3x _PyBytes_Resize |
< 128k | 3x PyBytes_New; 2x PyList_Resize; 1x _PyBytes_Resize;;1x PyList_New; 1x PyBytes_Join | 1x PyBytes_New; 4x _PyBytes_Resize |
BlocksOutputBuffer does much more work initially. However at much larger buffer sizes, the cost of PyList_Resize decreases due to cost amortization. Conversely arrange_output_buffer's calls to _PyBytes_Resize are initially very cheap, because this will return the same pointer many times. When the buffer sizes get larger, resizing almost always means creating a new bytes object + an additional memcpy to transfer the data. So it is clear why BlocksOutputBuffer performs better when the size of the output gets larger. It is also clear why arrange_output_buffer will always perform better at lower output sizes.
So how about memory usage? It turns out that for small data the blocksoutputbuffer allocates more agressively:
/* According to the block sizes defined by BUFFER_BLOCK_SIZE, the whole
allocated size growth step is:
1 32 KB +32 KB
2 96 KB +64 KB
3 352 KB +256 KB
4 1.34 MB +1 MB
5 5.34 MB +4 MB
So there is less overallocation when using arrange_output_buffer as well.
ping
@Yhg1s, @gpshead (as zlib experts)
cc @animalize (as the author of gh-21740)
ping
This issue is not urgent (and being microbenchmark based isn't entirely clear there is an actual issue).
I recognize the added code complexity with the additional list of bytes buffer code in place most everywhere now. But if you want to make an argument against it, a word of warning about how to communicate this: Tossing around words like "pathological" and "common case" with only a microbenchmark as data to show for the usage pattern you are biased towards comes off as aggressively antagonistic rather than persuasive.
So lets jump back to actual real world non microbenchmark data: What does running the pyperformance suite show about main
at your PR #101279 branch point vs the PR? That represents practical applications.
Accepting PR #101279 would be flip-flopping back to another way of doing things with users who perceive a greater need for one implementation just pointing fingers at the others. So I've marked the PR as draft and do-not-merge for the time being just so it is clear to other committers who may see it that we shouldn't assume that PR is the direction we want to go without further discussion and non-arbitrary data on the issue.
When it comes to optimizations around buffering, one implementation may not fit all needs. Implementations seeking to eeek out the last few percent of performance sometimes contain multiple strategies and dynamically pick which to use based on heuristics. That'd also be a possible viable outcome that could satisfy both performance desires here. But in the absence of relevant non-contrived microbenchmarks, making that judgement call isn't something we should just do on a whim. Different applications have different needs and API call patterns.
But if you want to make an argument against it, a word of warning about how to communicate this: Tossing around words like "pathological" and "common case" with only a microbenchmark as data to show for the usage pattern you are biased towards comes off as aggressively antagonistic rather than persuasive.
I am so sorry. I did not intend to be so abrasive. @gpshead thank you for calling me out on this. @animalize, I hope you will accept my apology for my behavior. This was certainly not an acceptable way to communicate and I do respect the work you have done on this feature.
When it comes to optimizations around buffering, one implementation may not fit all needs. [...] Different applications have different needs and API call patterns.
Yes the current implementation has the advantage of not overallocating when very large compressed files are decompressed in memory entirely. That I don't know of such use cases will not make them magically disappear.
Given that the gzip application already works with an optimized strategy for buffer allocation for that particular use case it is indeed not urgent. I will benchmark it better with real-world applications and come back with my findings.
Bug report
The _BlocksOutputBuffer was introduced for all compression modules (bz2, lzma and zlib) in python 3.10 https://github.com/python/cpython/pull/21740 .
It performs well on its advertised quality: speedy and memory efficient creation of very large in-memory blocks.
However it came at the cost of:
The common use case is small things in memory. Having large in-memory buffers (multiple gigabytes) is an anti-pattern and streaming interfaces should be used instead. Which in turn utilize small buffers.
For a small buffer the BlocksOutputBuffer needs to create a list and a bytes object, while the 3.9 method only creates the bytes object. When the initial bytes object is too small, the blocksoutputbuffer creates another bytes object and resizes the list. The 3.9 method simply resizes the bytes object. The 3.9 method does not scale well beyond a large number of resizings while the BlocksOutputBuffer does (due to cost amortization in the list resize and the fact that it doesn't resize bytes objects, saving on memcpy) but this only applies to very large buffers, and these should be rare.
This can be shown by removing the blocksoutputbuffer and reverting to the 3.9 method of arranging the output buffer. I have made a branch here https://github.com/rhpvorderman/cpython/tree/noblocks .
Microbenchmarks. Taking current README.rst (10044 bytes) and compressing with compression level 1. BlocksoutputBuffer
arrange_output_buffer:
Also when taking a larger file Lib/_pydecimal.py (229220 uncompressed, 60974 bytes compressed) which certainly requires resizing the initial 16K buffer. BlocksOutputBuffer:
arrange_output_buffer
Q.E.D. BlocksOutputBuffer always loses against simple byte resizing for smaller buffers. _pydecimal.py is more than 200K so that is already quite a big input.
Additionally:
And this is not counting the removal of the extra header file that provides the BlocksOutputBuffer.
The BlocksOutputBuffer is a nice piece of work when taken in isolation, but it optimizes for the pathological case rather than the common case. It is detrimental to the common case and it requires a lot of extra code that needs to be maintained.
If there are some issues with slow performance on larger buffers with arrange_output_buffer, I think several optimizations can still be done which are not as invasive as the _BlocksOutputBufer. For instance when zlib.decompress is called on a 100MB object, it makes sense to start the output buffer at 100MB rather than at 16K (default in 3.9). This severely limits the amount of resizes required. The same goes for zlib.compress. If the input is 100MB the maximum output is going to be maximally 100MB+header and trailer size. Reserving 100MB first and then downscaling to the compressed size (say 20MB) is much quicker than resizing a 16K buffer to 20MB using doublings only.
Linked PRs