Open vstinner opened 1 month ago
cc @serhiy-storchaka @encukou
cc @davidism
I believe we can do better. I'll write a longer reply this week.
I believe we can do better.
My gut feeling as well. I don't think we want to encourage authors to specialise their own code when using the limited API. An efficient, abstraction-agnostic, multiple find-replace should be our job (if it's an important scenario, which I'm inclined to think it probably is).
My gut feeling as well. I don't think we want to encourage authors to specialise their own code when using the limited API.
C extensions already specialize their code for the Unicode native format (ex: MarkupSafe). The problem is that they cannot update their code to the limited API because of missing features to export to/import from native formats.
An efficient, abstraction-agnostic, multiple find-replace should be our job (if it's an important scenario, which I'm inclined to think it probably is).
That sounds like a very specific API solving only one use case.
C extensions already specialize their code for the Unicode native format
Which means we forced them into it, not that they wanted to. Just because we did something not great for them in the past means we have to double-down on it into the future.
That sounds like a very specific API solving only one use case.
The same could be said about your proposed API. Do you have more use cases?
I'm not going to argue heavily about this case until Petr provides his writeup. I haven't given it a huge amount of thought, just felt it was worth adding a +1 to "this doesn't feel great".
The same could be said about your proposed API.
What do you mean? I don't force users to use this API. It's an API which can be used to specialize code for UCS1, UCS2 and UCS4: it's the only way to write the most efficient code for the current implementation of Python.
Do you have more use cases?
Python itself has a wide library to specialize code for UCS1 / UCS2 / UCS4 in Objects/stringlib/
:
Objects/stringlib/asciilib.h
Objects/stringlib/codecs.h
Objects/stringlib/count.h
Objects/stringlib/ctype.h
Objects/stringlib/eq.h
Objects/stringlib/fastsearch.h
Objects/stringlib/find.h
Objects/stringlib/find_max_char.h
Objects/stringlib/join.h
Objects/stringlib/localeutil.h
Objects/stringlib/partition.h
Objects/stringlib/replace.h
Objects/stringlib/split.h
Objects/stringlib/stringdefs.h
Objects/stringlib/transmogrify.h
Objects/stringlib/ucs1lib.h
Objects/stringlib/ucs2lib.h
Objects/stringlib/ucs4lib.h
Objects/stringlib/undef.h
Objects/stringlib/unicode_format.h
There are different operations which are already optimized. The problem is that the macro PyUnicode_WRITE_CHAR() is implemented with 2 tests, it's inefficient. Well, in most cases, PyUnicode_WRITE_CHAR() is good enough. But not when you want the optimal case, such as MarkupSafe which has a single function optimized in C.
There are 27 projects using PyUnicode_FromKindAndData()
according to a code search in PyPI top 7,500 projects (2024-03-16):
There are already other, less efficient, ways to export a string, depending on its maximum character. But some of these functions are excluded from the limited C API.
Proposed PyUnicode_AsNativeFormat()
has a complexity of O(1) which is an important property, whereas these functions has at least a complexity of O(n) if not worse.
Note: I don't understand why there is no PyUnicode_FromUCS4() function. So it's not possible to re-import a UCS4 string. Importing UCS4 can be done using PyUnicode_FromKindAndData(PyUnicode_FromKindAndData)
.
it's the only way to write the most efficient code for the current implementation of Python
"Most efficient code" is not the job of the limited API. At a certain level of exposure to internals, you need to stop using the limited API, or else accept that your performance will be reduced.
If the AsUTF8
/FromUTF8
functions are not fast enough for markupsafe's multi-replace function, then as I said, I'd entertain a multi-replace function that doesn't leak implementation details, because that suits the limited API. I'd also consider some kind of builder API, which I believe you also have a proposal for.
But this proposal is literally about leaking implementation details. That is against the intent of the limited API, and so I am against the proposal.
Thanks for making me think about it, you've upgraded me from "unsure" to "sure" ;) Still looking forward to Petr's thoughts.
@davidhewitt: Would Rust benefit from such "native format" API? Or does Rust just prefer UTF-8 and then decode UTF-8 in Python?
@da-woods: Would Cython benefit from such "native format" API? I see that Cython uses the following code in __Pyx_PyUnicode_Substring()
:
#if CYTHON_COMPILING_IN_LIMITED_API
// PyUnicode_Substring() does not support negative indexing but is otherwise fine to use.
return PyUnicode_Substring(text, start, stop);
#else
return PyUnicode_FromKindAndData(PyUnicode_KIND(text),
PyUnicode_1BYTE_DATA(text) + start*PyUnicode_KIND(text), stop-start);
#endif
At the very least, please rename it to "InternalFormat" instead of "NativeFormat".
At first I thought this was going to be a great API for following the conventions of the native machine, since the name led me incorrectly. It took me a few reads to figure out that it was telling me the format, not that I was getting to choose it.
I looked how some of these projects use PyUnicode_FromKindAndData
.
PyUnicode_Substring
instead of PyUnicode_FromKindAndData
in the limited API.
https://github.com/cython/cython/blob/cf10ea12e3fc3637cf14e1bf962b295f53cc1a50/Cython/Utility/StringTools.c#L562-L568PyUnicode_FromKindAndData
with PyUnicode_4BYTE_KIND
. It could use PyUnicode_DecodeUTF32
.
https://github.com/rapidfuzz/Levenshtein/blob/0c6d3dc4f3cdf80ce19392fd502733d80e6a5a5d/src/Levenshtein/levenshtein_cpp.pyx#L101PyUnicode_FromKindAndData
with PyUnicode_2BYTE_KIND
. It could use PyUnicode_DecodeUTF16
. How does Qt5 represents non-BMP characters? If it uses UTF-16, then PyUnicode_DecodeUTF16
is the only correct solution.
https://github.com/baoboa/pyqt5/blob/11d5f43bc6f213d9d60272f3954a0048569cfc7c/designer/pluginloader.cpp#L174
https://github.com/baoboa/pyqt5/blob/11d5f43bc6f213d9d60272f3954a0048569cfc7c/qmlscene/pluginloader.cpp#L336PyUnicode_FromKindAndData
with PyUnicode_4BYTE_KIND
.
https://github.com/MKuranowski/aiocsv/blob/0d8499a68cfe60b38d9f8fbab8c0b0c9f1b54c4a/aiocsv/_parser.c#L464PyUnicode_FromKindAndData
via Cython either with PyUnicode_1BYTE_KIND
for short ASCII-only UUID strings (PyUnicode_FromString
/PyUnicode_DecodeASCII
/PyUnicode_DecodeLatin1
/PyUnicode_DecodeUTF8
can be used here) or with PyUnicode_4BYTE_KIND
to create substrings from the buffer created by PyUnicode_AsUCS4Copy
(PyUnicode_Substring
can be a little more efficient here).
https://github.com/lemoncode21/fastapi-reactjs-loginpage/blob/c156bf187d061030e221553b7bc94978839ec994/backend/venv/Lib/site-packages/asyncpg/pgproto/uuid.pyx#L168-L170
https://github.com/MIoCJluTeJllo/betwatchBot/blob/0c634eab1c47d6c421de2ab074d047658d2e362e/venv/Lib/site-packages/asyncpg/protocol/codecs/array.pyx#L644-L647They usually use PyUnicode_FromKindAndData
in only one or two places. PyUnicode_FromKindAndData
always can be replaced with Latin1/UTF16/UTF32 decoder (with the "surrogatepass" error handler for UTF16/UTF32). I think this is why it was not included in the limited C API. The examined project do not need to use PyUnicode_FromKindAndData
, there is always alternative in the limited C API.
I do not see how PyUnicode_AsNativeFormat
and PyUnicode_FromNativeFormat
could help in the examined projects. They do not need conversion to/from internal format, they need conversion to/from a specific format. To utilize the benefit of working with the internal representation they need to reorganize their code in stringlib-like way, with a lot of macros and preprocessor directives (is it possible to do in Cython?) It has high implementation and maintenance cost. And it will likely will not work with PyUnicode_NATIVE_UTF8. And in some cases you need the UCS4 output buffer, because the stringlib-like way is bad for two-parameter specialization.
The Cython usage discussed above is an attempt at micro-optimizing slicing unicode. I think the need to handle UTF-8 as a possibility would mean we couldn't usefully use this as a replacement here (because of the variable character length). In this case I think PyUnicode_Substring
is exactly the right thing to use with the level of visibility we have in the Limited API.
There's maybe a few places we could use PyUnicode_AsNativeFormat
- possibly in optimizing things like __Pyx_PyUnicode_Equals
. I'm not sure that one case is a hugely compelling use-case - the current fallback isn't awful. I think for a lot of uses the possibility of having to handle UTF8 would limit their usefulness because quick indexing isn't an option (although we could always special-case that to a fallback).
PyUnicode_FromNativeFormat
would probably be more useful (maybe not for us to use directly, but as an API that users receiving text data from C might want to use).
I think for a lot of uses the possibility of having to handle UTF8 would limit their usefulness because quick indexing isn't an option (although we could always special-case that to a fallback).
PyUnicode_NATIVE_UTF8 is not used internally by Python, so PyUnicode_AsNativeFormat() doesn't use it. I added it for PyPy, if PyPy would like to adopt this C API. Maybe I can remove it to avoid confusion.
I added it for PyPy, if PyPy would like to adopt this C API. Maybe I can remove it to avoid confusion.
Is it OK for an implementation to have a native format that's not in the fixed list? And if so then should they set an exception? That suggests that to be truly compatible extension authors should be prepared to have a fallback non-optimized implementation for valid unicode objects without a native format?
The result of PyUnicode_AsNativeBuffer
is only valid as long as you have a
reference to the unicode object. IMO, if we add the function as proposed,
it should have Borrow
in the name, even though the thing that's borrowed
isn't a Python object.
But, as I said, I think we can do better.
I don't think the limited API should avoid optimizations. We just need to support the API & ABI. We shouldn't guarantee that a certain function remains fast in the future (for that, users do need to benchmark and sometimes re-build), but if something is fast now and still supportable tomorrow, IMO we can add it.
Anyway, the issue I have with the proposed API is:
PyUnicode_AsUTF8
?).PyUnicode_AsNativeFormat
(an exception), while the situation should be handled the same way as
an unknown result format -- i.e. with a fallback.What should the fallback be? Generally, if we can't do a zero-copy export (as
PyUnicode_AsNativeFormat
does) there are two next-best options:
I'll skip exploring the former here. (I did think about that part of that design space
but didn't find much of interest: it lead me to an API much like PyLong_AsNativeBytes
,
with an ominously inviting rabbit hole of array-export features like chunking
that we should probably design for lists first.)
For the latter, we have a rather nice mechanism: PyBuffer
.
We could provide a function like this:
PyAPI_FUNC(int) PyUnicode_GetBuffer(
PyObject *unicode,
Py_buffer *buf,
int buffer_flags, // as for PyObject_GetBuffer
int32_t supported_formats,
int32_t *result_format);
with supported_formats being flags indicating what the caller can take, OR-ed together:
#define PyUnicode_BUFFER_FORMAT_ASCII 0x01
#define PyUnicode_BUFFER_FORMAT_UCS1 0x02
#define PyUnicode_BUFFER_FORMAT_UCS2 0x04
#define PyUnicode_BUFFER_FORMAT_UCS4 0x08
#define PyUnicode_BUFFER_FORMAT_UTF8 0x10
On success, *result_format
would be set to the format of the result,
i.e. just one of the flags. (We can't use Py_buffer.format
since ASCII,
UCS1 and UTF8 all deal with single bytes.)
The new idea here (AFAIK, at least for stdlib) is that PyUnicodeType
wouldn't
provide a bf_getbuffer
(i.e. it wouldn't support PyObject_GetBuffer
), since
it needs additional info for the export.
It would, however, support bf_releasebuffer
.
The beauty of this design is that it can also do the zero-copy export
(like PyUnicode_AsNativeFormat
):
Py_buffer
. Note that
Py_buffer
holds a reference to the exporting string, so we don't get
“borrowing” issues.PyUnicode_GetBuffer
allocates memory, does the conversion, and sets a flag
in PyBuffer.internal
telling bf_releasebuffer
to free the memory.In other words, if you guess the internals correctly (or are prepared to handle enough alternative formats), you get zero-copy export. If not, your code still works correctly (as long as the string exportable to a format you can handle). I can't think of a use case where you'd need to know which of those happened, or where you'd need to check up-front.
PyUnicode_GetBuffer
add some overhead over PyUnicode_AsNativeFormat
:
Py_buffer
is a bigger struct to fill.
If that is a real problem, we should resurrect Victor's idea of a leaner
buffer struct (but it'd still need a pointer, size, owner, and an
internal field).PyBuffer_Release
. (If we want to avoid “borrowing”,
we'll always need some variation of a drop function.)If either of those are a performance issue, then you probably shouldn't be using the stable ABI ;)
If we do add the PyUnicode_BUFFER_*
flags, then we should add a constructor
for symmetry:
PyAPI_FUNC(int) PyUnicode_FromBuffer(
// (edit: removed mistaken argument)
void *data,
Py_ssize_t nbytes,
int32_t data_format);
(It's not necessary to require the full Py_buffer
struct here; its buf
and len
are easy enough to get if you have one.)
We've been expeoring this kind of API with @davidhewitt. A rough draft/exploration branch is at https://github.com/davidhewitt/cpython/commits/stringbuffer/; currently on hiatus for real-life reasons.
We could provide a function like this: ... with _supportedformats being flags indicating what the caller can take
Consider me a big fan (probably unsurprising given the similarity to my all-API forward compatibility proposal). An API that always works, is as fast as you let it be, and makes clear that you will need a fallback is about as ideal as I think things can get.
I like the supported_formats
format idea! It solves the problem of PyPy which uses UTF-8 and can surprise C extensions not designed/prepared for that.
Py_buffer is a bigger struct to fill. If that is a real problem, we should resurrect Victor's idea of a leaner buffer struct (but it'd still need a pointer, size, owner, and an internal field).
I abandonned my draft PEP PyResource for 2 reasons:
I dislike reusing PyBuffer
API since it's heavy and complicated. I don't see the purpose of buffer_flags for example. PyBuffer
supports non-contiguous memory, it supports memory shapes, and other complicated stuff. All Python implementations use contiguous memory for strings. If a Python implementation use non-contiguous memory, it should create it on demand.
I would prefer a more specialized ("simpler") API:
PyAPI_FUNC(int) PyUnicode_AsFormat(
PyObject *unicode,
int32_t supported_formats,
const void *data,
Py_ssize_t *size,
int32_t *result_format);
PyAPI_FUNC(PyObject*) PyUnicode_FreeFormat(
PyObject *unicode, // <= the interesting part
const void data,
Py_ssize_t size,
int32_t format);
PyAPI_FUNC(PyObject*) PyUnicode_FromFormat(
const void *data,
Py_ssize_t size,
int32_t format);
PyUnicode_AsFormat()
would do nothing (just return a pointer) in most cases. In this case, PyUnicode_FreeFormat()
does nothing.
PyUnicode_AsFormat()
allocates memory if the request format is not supported. In this case, PyUnicode_FreeFormat()
releases the memory. Passing unicode allows to know if the format requires to release memory or not.
PyBuffer supports non-contiguous memory, it supports memory shapes, and other complicated stuff. All Python implementations use contiguous memory for strings.
It does, but it also allows the callers to specify which features they can handle. (Hmm, I think I've seen that before!)
That's what buffer_flags does: if you use PyBUF_SIMPLE
, complex features won't be used.
I like reusing API we already have, but I won't die on that hill.
@zooba @serhiy-storchaka: Do you prefer PyBuffer or pointer+size for these specific APIs?
I'm fine with either. Maybe +0 towards PyBuffer just so that the API doesn't necessarily have to be allocation-free (that is, the object being freed is different from the original one - otherwise, we'd have to track all additional allocations against the original object).
But it does feel a bit heavy for a case where the point is to look inside our internal data structures. I'd prefer that kind of operation to feel as rough as possible, to discourage people from doing it 😉
Sorry to be slow to reply here, I started writing a reply describing the investigation I did with @encukou and then they managed to finish writing first 😁
I am happy with either the buffer or specialized form of AsFormat
for consumption from PyO3. We can make either API safe on the Rust side.
You are welcome to take my branch commit which @encukou and I built if it's useful. Currently lacks tests IIRC 🫣
I updated the PR to a new API:
#define PyUnicode_FORMAT_ASCII 0x01
#define PyUnicode_FORMAT_UCS1 0x02
#define PyUnicode_FORMAT_UCS2 0x04
#define PyUnicode_FORMAT_UCS4 0x08
#define PyUnicode_FORMAT_UTF8 0x10
// Get the content of a string in the requested format:
// - Return the content, set '*size' and '*format' on success.
// - Set an exception and return NULL on error.
//
// The export must be released by PyUnicode_ReleaseExport().
PyAPI_FUNC(const void*) PyUnicode_Export(
PyObject *unicode,
uint32_t supported_formats,
Py_ssize_t *size,
uint32_t *format);
// Release an export created by PyUnicode_Export().
PyAPI_FUNC(void) PyUnicode_ReleaseExport(
PyObject *unicode,
const void* data,
uint32_t format);
// Create a string object from a string in the format 'format'.
// - Return a reference to a new string object on success.
// - Set an exception and return NULL on error.
PyAPI_FUNC(PyObject*) PyUnicode_Import(
const void *data,
Py_ssize_t size,
uint32_t format);
PyUnicode_ReleaseExport()
only has to free memory if the string kind is not UCS4 and the requested format is UCS4.
It's possible to export any string to UCS4 and UTF-8. In the UTF-8 case, the encoded UTF-8 string is cached into the Unicode object: similar to PyUnicode_AsUTF8().
Currently, I didn't implement exporting UCS1 to UCS2. Only export to UCS4 or UTF-8 is supported. Well, obviously, if the requested format include ASCII and UCS1, the native format is used (no copy or conversion needed).
EDIT: I changed the format type to uint32_t
.
PyUnicode_ReleaseExport()
only has to free memory if ...
Today, sure, but it protects us for the future. And allows other implementations to handle more cases.
I find "release export" a slightly uncomfortable name - did we have any other ideas? The API reminds me of LockBuffer
and UnlockBuffer
style APIs (or LockBits
/UnlockBits
, but I don't remember where that's from... bitmaps, perhaps?).
But I do prefer this naming to "native format", so consider me in favour, but happy to see a name change if someone comes up with a brilliant idea.
did we have any other ideas?
I started with PyUnicode_FreeExport()
.
Today, sure, but it protects us for the future. And allows other implementations to handle more cases.
Oh sure, I just described my current implementation.
@encukou @serhiy-storchaka @da-woods: What do you think of the updated API https://github.com/python/cpython/issues/119609#issuecomment-2160673752?
unsigned int
for the format type. It was proposed to use uint32_t
instead, I don't have a strong argument about it.PyBuffer_Release()
.PyUnicode_ReleaseExport()
function name.I won't have time for a full review this week.
I used unsigned int for the format type. It was proposed to use
uint32_t
instead, I don't have a strong argument about it.
Please use uint32_t
then.
C only guarantees 16 bits in an int
, but major platforms make it 32 bits. With int
, 16 bits are “wasted” rather than “reserved for future formats”.
I won't have time to review it this week.
I used unsigned int for the format type. It was proposed to use uint32_t instead, I don't have a strong argument about it.
Please use uint32_t
, then.
int
is commonly 32 bits, but only 16 are guaranteed. I'd like to see the extra 16 “reserved for future formats” rather than “wasted”.
And there are generic arguments for fixed-width types; but that's all in another discussion.
Please use uint32_t then.
Ok, done.
@encukou:
I won't have time for a full review this week.
Ping. Do you have time for a review this week? :-)
Hi, I sent my suggestions in a PR: https://github.com/vstinner/cpython/pull/3 Let me know what you think :)
Let me know what you think :)
There is an issue with your PR, it has something like 150 commits.
Yes, I merged in the main barnch to fix the conflicts. If you fix conflicts in your PR, I can rebase and just leave the last 6.
And of course I'd still prefer exporting PyBuffer
, rather than making PyBuffer_Release
take three arguments.
@encukou: I integrated most of your suggestions in my PR. Please see the updated PR.
I created Add PyUnicode_Export() and PyUnicode_Import() to the limited C API issue in the C API WG Decisions project.
Feature or enhancement
PEP 393 – Flexible String Representation changed the Unicode implementation in Python 3.3 to use 3 string "kinds":
PyUnicode_KIND_1BYTE
(UCS-1): ASCII and Latin1, [U+0000; U+00ff] range.PyUnicode_KIND_2BYTE
(UCS-2): BMP, [U+0000; U+ffff] range.PyUnicode_KIND_4BYTE
(UCZ-4): Full Unicode Character Set, [U+0000; U+10ffff] range.Strings must always use the optimal storage: ASCII string must be stored as PyUnicode_KIND_2BYTE.
Strings have a flag indicating if the string only contains ASCII characters: [U+0000; U+007f] range. It's used by multiple internal optimizations.
This implementation is not leaked in the limited C API. For example, the
PyUnicode_FromKindAndData()
function is excluded from the stable ABI. Said differently, it's not possible to write efficient code for PEP 393 using the limited C API.I propose adding two functions:
PyUnicode_AsNativeFormat()
: export to the native formatPyUnicode_FromNativeFormat()
: import from the native formatThese functions are added to the limited C API version 3.14.
Native formats (new constants):
PyUnicode_NATIVE_ASCII
: ASCII string.PyUnicode_NATIVE_UCS1
: UCS-1 string.PyUnicode_NATIVE_UCS2
: UCS-2 string.PyUnicode_NATIVE_UCS4
: UCS-4 string.PyUnicode_NATIVE_UTF8
: UTF-8 string (CPython implementation detail: only supported for import, not used by export).Differences with
PyUnicode_FromKindAndData()
:PyUnicode_NATIVE_ASCII format allows further optimizations.
PyUnicode_NATIVE_UTF8 can be used by PyPy and other Python implementation using UTF-8 as the internal storage.
API:
See the attached pull request for more details.
This feature was requested to me to port the MarkupSafe C extension to the limited C API. Currently, each release requires producing around 60 wheel files which takes 20 minutes to build: https://pypi.org/project/MarkupSafe/#files
Using the stable ABI would reduce the number of wheel packages and so ease their release process.
See src/markupsafe/_speedups.c: string functions specialized for the 3 string kinds (UCS-1, UCS-2, UCS-4).
Linked PRs