Open gpshead opened 1 month ago
I think this is a good idea, but it seems that relying too much on external C libraries may cause instability (such as many linking and compilation problems). Do you have any better solution to this problem?
We're not really concerned about that. We aleady link against a bunch of libraries and have vendored a few (as much as we try to avoid that), including some involving SIMD code, in the past.
For the performances themselves, they are worth the inclusion IMO (base64 is very common in web applications)
I don't think it is currently widely available
The default base64 lib comes with coreutils IIRC (at least on most Linux distributions). But for SIMD support, I don't know of any that is well-known (there is an Alpine Linux package for that though which directly uses the lib you linked but well.. that's the only one).
yuck, but ideally only temporary until distros pick it up as a package of its own
Yes, but that could give distros more confidence and reasons in pushing it into a proper libbase64-devel
package. We could also cut off some parts that are not needed and platforms that are not supported
bytes.translate, bytearray.translate, or str.translate might benefit from similar SIMD treatment
I think it's a good idea to make built-in functions faster when possible if the cost of introducing the overall machinary is not too much (but yes, let's open a separate issue if needs arise).
From a general approach, I am not sure it's worth maintaining SIMD-related logic in CPython. (I know that we already have such logic from some modules, but most of them are vendored libraries..) There have been several attempts to accelerate the JSON module by using SIMD too (and they are actively managed by 3rd parties). Are we going to do the same thing? or do we have any criteria for what to accelerate from CPython or what to accelerate from 3rd party side?
FWIW, we already have SIMD and/or accelerated instruction use in hashlib code (and I expect to add more, as HACL* can provide it to us - our goal is to eliminate the need to link hashlib against openssl).
base64
is a much simpler obvious use for SIMD as the algorithm is dead simple and clearly amenable to doing a lot of well patterned simple math at once. And is frequently done on bulk data due to how common standards embed binary in text using it.
It feels like the kind of thing that a C version of, if written in the appropriate manner, might get a compiler to auto-vectorize? But tricking compilers into doing that for you instead of explicitly just doing what you wanted to happen can be less maintainable.
In general though I wouldn't reach for SIMD on things like our main data types unless it demonstrated a meaningful measurable performance suite win. (bytes.translate is rather niche in that regard)
JSON is a similarly important widely used bulk data format. A lot more complicated to use SIMD on. There are a variety of libraries out there doing it. If anyone wants to consider a behavior compatible version doing that as the json
stdlib module, that belongs in its own issue.
A more interesting library to consider linking against than what I opened with may be https://github.com/simdutf/simdutf which covers UTF8 & friends as well as base64. (though it is C++ and I believe we've so far avoided requiring C++ linkage within CPython itself?)
Unrelated other than the SIMD theme - https://pypi.org/project/stringzilla/. It does things that even adopting simplest conservative versions of them for use by our CPython internals in a compatible manner might be enough to show up on the performance suite. Someone would need to take on trying it and testing to find out.
do we have any criteria for what to accelerate from CPython or what to accelerate from 3rd party side?
My #.1 criteria is maintainability. #.2 criteria is "is it widely used on enough data to matter?" with a #.3 feeding into that of "Would doing this reduce the need to train people to always use a special PyPI package instead of the stdlib for their common task that either can do because of a performance concern?"
Performance enhancement
Proposal:
https://pypi.org/project/pybase64/ aka https://github.com/mayeut/pybase64 (BSD licensed) exists. On top of some of its own SIMD code for base64 module extra features (character translation)^, it links against https://github.com/aklomp/base64, a BSD licensed C99 library with SIMD acceleration giving 5-20x performance on base64 encoding and decoding operations vs our existing generic byte based base64 C code.
We could adopt a bunch of the pybase64 code to make the default base64 module experience better - it is relatively straight forward extension module code (as one would expect). On the other hand, I expect pybase64 to still be where new development and further improvements in this space continue to happen as people who care strongly about performance need the latest and greatest from PyPI regardless of their current CPython version. (looping in @mayeut for thoughts on that)
Practicalities: Library availability? we'd vendor a libbase64 build for use on our binary distributions. I don't think it is currently widely available (? I only did a quick search on Ubuntu) as a package on Linux distributions though so we'd currently need to vendor our own copy in tree to be fair and match the good performance there (yuck, but ideally only temporary until distros pick it up as a package of its own, consider it similar to a Modules/_decimal/libmpdec/ situation - our configure.ac finds an installed one & distros link against that)
Risks: It is a new C library dependency. Security concerns within it thus become our own. As
base64
is frequently used to process untrusted input. But its surface of possible problems is limited (very simple data format). We should ensure the library gets proper oss-fuzz test coverage before adoption (@aklomp for visibility).^
bytes.translate
,bytearray.translate
, orstr.translate
might benefit from similar SIMD treatment - which would be better from a CPython perspective than only doing that within this module? If so, lets file a new issue just for that bit.Has this already been discussed elsewhere?
No response given
Links to previous discussion of this feature:
If we spawn Discuss threads around this, lets edit and drop links here.