python / cpython

The Python programming language
https://www.python.org/
Other
60.94k stars 29.41k forks source link

[C API] Add PyUnicode_Export() and PyUnicode_Import() functions #119609

Open vstinner opened 1 month ago

vstinner commented 1 month ago

Feature or enhancement

PEP 393 – Flexible String Representation changed the Unicode implementation in Python 3.3 to use 3 string "kinds":

Strings must always use the optimal storage: ASCII string must be stored as PyUnicode_KIND_2BYTE.

Strings have a flag indicating if the string only contains ASCII characters: [U+0000; U+007f] range. It's used by multiple internal optimizations.

This implementation is not leaked in the limited C API. For example, the PyUnicode_FromKindAndData() function is excluded from the stable ABI. Said differently, it's not possible to write efficient code for PEP 393 using the limited C API.


I propose adding two functions:

These functions are added to the limited C API version 3.14.

Native formats (new constants):

Differences with PyUnicode_FromKindAndData():

PyUnicode_NATIVE_ASCII format allows further optimizations.

PyUnicode_NATIVE_UTF8 can be used by PyPy and other Python implementation using UTF-8 as the internal storage.


API:

#define PyUnicode_NATIVE_ASCII 1
#define PyUnicode_NATIVE_UCS1 2
#define PyUnicode_NATIVE_UCS2 3
#define PyUnicode_NATIVE_UCS4 4
#define PyUnicode_NATIVE_UTF8 5

// Get the content of a string in its native format.
// - Return the content, set '*size' and '*native_format' on success.
// - Set an exception and return NULL on error.
PyAPI_FUNC(const void*) PyUnicode_AsNativeFormat(
    PyObject *unicode,
    Py_ssize_t *size,
    int *native_format);

// Create a string object from a native format string.
// - Return a reference to a new string object on success.
// - Set an exception and return NULL on error.
PyAPI_FUNC(PyObject*) PyUnicode_FromNativeFormat(
    const void *data,
    Py_ssize_t size,
    int native_format);

See the attached pull request for more details.


This feature was requested to me to port the MarkupSafe C extension to the limited C API. Currently, each release requires producing around 60 wheel files which takes 20 minutes to build: https://pypi.org/project/MarkupSafe/#files

Using the stable ABI would reduce the number of wheel packages and so ease their release process.

See src/markupsafe/_speedups.c: string functions specialized for the 3 string kinds (UCS-1, UCS-2, UCS-4).

Linked PRs

vstinner commented 1 month ago

cc @serhiy-storchaka @encukou

vstinner commented 1 month ago

cc @davidism

encukou commented 1 month ago

I believe we can do better. I'll write a longer reply this week.

zooba commented 1 month ago

I believe we can do better.

My gut feeling as well. I don't think we want to encourage authors to specialise their own code when using the limited API. An efficient, abstraction-agnostic, multiple find-replace should be our job (if it's an important scenario, which I'm inclined to think it probably is).

vstinner commented 1 month ago

My gut feeling as well. I don't think we want to encourage authors to specialise their own code when using the limited API.

C extensions already specialize their code for the Unicode native format (ex: MarkupSafe). The problem is that they cannot update their code to the limited API because of missing features to export to/import from native formats.

An efficient, abstraction-agnostic, multiple find-replace should be our job (if it's an important scenario, which I'm inclined to think it probably is).

That sounds like a very specific API solving only one use case.

zooba commented 1 month ago

C extensions already specialize their code for the Unicode native format

Which means we forced them into it, not that they wanted to. Just because we did something not great for them in the past means we have to double-down on it into the future.

That sounds like a very specific API solving only one use case.

The same could be said about your proposed API. Do you have more use cases?

I'm not going to argue heavily about this case until Petr provides his writeup. I haven't given it a huge amount of thought, just felt it was worth adding a +1 to "this doesn't feel great".

vstinner commented 1 month ago

The same could be said about your proposed API.

What do you mean? I don't force users to use this API. It's an API which can be used to specialize code for UCS1, UCS2 and UCS4: it's the only way to write the most efficient code for the current implementation of Python.

Do you have more use cases?

Python itself has a wide library to specialize code for UCS1 / UCS2 / UCS4 in Objects/stringlib/:

Objects/stringlib/asciilib.h
Objects/stringlib/codecs.h
Objects/stringlib/count.h
Objects/stringlib/ctype.h
Objects/stringlib/eq.h
Objects/stringlib/fastsearch.h
Objects/stringlib/find.h
Objects/stringlib/find_max_char.h
Objects/stringlib/join.h
Objects/stringlib/localeutil.h
Objects/stringlib/partition.h
Objects/stringlib/replace.h
Objects/stringlib/split.h
Objects/stringlib/stringdefs.h
Objects/stringlib/transmogrify.h
Objects/stringlib/ucs1lib.h
Objects/stringlib/ucs2lib.h
Objects/stringlib/ucs4lib.h
Objects/stringlib/undef.h
Objects/stringlib/unicode_format.h

There are different operations which are already optimized. The problem is that the macro PyUnicode_WRITE_CHAR() is implemented with 2 tests, it's inefficient. Well, in most cases, PyUnicode_WRITE_CHAR() is good enough. But not when you want the optimal case, such as MarkupSafe which has a single function optimized in C.

There are 27 projects using PyUnicode_FromKindAndData() according to a code search in PyPI top 7,500 projects (2024-03-16):


There are already other, less efficient, ways to export a string, depending on its maximum character. But some of these functions are excluded from the limited C API.

Proposed PyUnicode_AsNativeFormat() has a complexity of O(1) which is an important property, whereas these functions has at least a complexity of O(n) if not worse.

Note: I don't understand why there is no PyUnicode_FromUCS4() function. So it's not possible to re-import a UCS4 string. Importing UCS4 can be done using PyUnicode_FromKindAndData(PyUnicode_FromKindAndData).

zooba commented 1 month ago

it's the only way to write the most efficient code for the current implementation of Python

"Most efficient code" is not the job of the limited API. At a certain level of exposure to internals, you need to stop using the limited API, or else accept that your performance will be reduced.

If the AsUTF8/FromUTF8 functions are not fast enough for markupsafe's multi-replace function, then as I said, I'd entertain a multi-replace function that doesn't leak implementation details, because that suits the limited API. I'd also consider some kind of builder API, which I believe you also have a proposal for.

But this proposal is literally about leaking implementation details. That is against the intent of the limited API, and so I am against the proposal.

Thanks for making me think about it, you've upgraded me from "unsure" to "sure" ;) Still looking forward to Petr's thoughts.

vstinner commented 1 month ago

@davidhewitt: Would Rust benefit from such "native format" API? Or does Rust just prefer UTF-8 and then decode UTF-8 in Python?

@da-woods: Would Cython benefit from such "native format" API? I see that Cython uses the following code in __Pyx_PyUnicode_Substring():

#if CYTHON_COMPILING_IN_LIMITED_API
    // PyUnicode_Substring() does not support negative indexing but is otherwise fine to use.
    return PyUnicode_Substring(text, start, stop);
#else
    return PyUnicode_FromKindAndData(PyUnicode_KIND(text),
        PyUnicode_1BYTE_DATA(text) + start*PyUnicode_KIND(text), stop-start);
#endif
zooba commented 1 month ago

At the very least, please rename it to "InternalFormat" instead of "NativeFormat".

At first I thought this was going to be a great API for following the conventions of the native machine, since the name led me incorrectly. It took me a few reads to figure out that it was telling me the format, not that I was getting to choose it.

serhiy-storchaka commented 1 month ago

I looked how some of these projects use PyUnicode_FromKindAndData.

They usually use PyUnicode_FromKindAndData in only one or two places. PyUnicode_FromKindAndData always can be replaced with Latin1/UTF16/UTF32 decoder (with the "surrogatepass" error handler for UTF16/UTF32). I think this is why it was not included in the limited C API. The examined project do not need to use PyUnicode_FromKindAndData, there is always alternative in the limited C API.

I do not see how PyUnicode_AsNativeFormat and PyUnicode_FromNativeFormat could help in the examined projects. They do not need conversion to/from internal format, they need conversion to/from a specific format. To utilize the benefit of working with the internal representation they need to reorganize their code in stringlib-like way, with a lot of macros and preprocessor directives (is it possible to do in Cython?) It has high implementation and maintenance cost. And it will likely will not work with PyUnicode_NATIVE_UTF8. And in some cases you need the UCS4 output buffer, because the stringlib-like way is bad for two-parameter specialization.

da-woods commented 1 month ago

The Cython usage discussed above is an attempt at micro-optimizing slicing unicode. I think the need to handle UTF-8 as a possibility would mean we couldn't usefully use this as a replacement here (because of the variable character length). In this case I think PyUnicode_Substring is exactly the right thing to use with the level of visibility we have in the Limited API.

There's maybe a few places we could use PyUnicode_AsNativeFormat - possibly in optimizing things like __Pyx_PyUnicode_Equals. I'm not sure that one case is a hugely compelling use-case - the current fallback isn't awful. I think for a lot of uses the possibility of having to handle UTF8 would limit their usefulness because quick indexing isn't an option (although we could always special-case that to a fallback).

PyUnicode_FromNativeFormat would probably be more useful (maybe not for us to use directly, but as an API that users receiving text data from C might want to use).

vstinner commented 1 month ago

I think for a lot of uses the possibility of having to handle UTF8 would limit their usefulness because quick indexing isn't an option (although we could always special-case that to a fallback).

PyUnicode_NATIVE_UTF8 is not used internally by Python, so PyUnicode_AsNativeFormat() doesn't use it. I added it for PyPy, if PyPy would like to adopt this C API. Maybe I can remove it to avoid confusion.

da-woods commented 1 month ago

I added it for PyPy, if PyPy would like to adopt this C API. Maybe I can remove it to avoid confusion.

Is it OK for an implementation to have a native format that's not in the fixed list? And if so then should they set an exception? That suggests that to be truly compatible extension authors should be prepared to have a fallback non-optimized implementation for valid unicode objects without a native format?

encukou commented 1 month ago

The result of PyUnicode_AsNativeBuffer is only valid as long as you have a reference to the unicode object. IMO, if we add the function as proposed, it should have Borrow in the name, even though the thing that's borrowed isn't a Python object. But, as I said, I think we can do better.


I don't think the limited API should avoid optimizations. We just need to support the API & ABI. We shouldn't guarantee that a certain function remains fast in the future (for that, users do need to benchmark and sometimes re-build), but if something is fast now and still supportable tomorrow, IMO we can add it.

Anyway, the issue I have with the proposed API is:

What should the fallback be? Generally, if we can't do a zero-copy export (as PyUnicode_AsNativeFormat does) there are two next-best options:

I'll skip exploring the former here. (I did think about that part of that design space but didn't find much of interest: it lead me to an API much like PyLong_AsNativeBytes, with an ominously inviting rabbit hole of array-export features like chunking that we should probably design for lists first.)

For the latter, we have a rather nice mechanism: PyBuffer. We could provide a function like this:

PyAPI_FUNC(int) PyUnicode_GetBuffer(
    PyObject *unicode,
    Py_buffer *buf,
    int buffer_flags,  // as for PyObject_GetBuffer
    int32_t supported_formats,
    int32_t *result_format);

with supported_formats being flags indicating what the caller can take, OR-ed together:

#define PyUnicode_BUFFER_FORMAT_ASCII 0x01
#define PyUnicode_BUFFER_FORMAT_UCS1 0x02
#define PyUnicode_BUFFER_FORMAT_UCS2 0x04
#define PyUnicode_BUFFER_FORMAT_UCS4 0x08
#define PyUnicode_BUFFER_FORMAT_UTF8 0x10

On success, *result_format would be set to the format of the result, i.e. just one of the flags. (We can't use Py_buffer.format since ASCII, UCS1 and UTF8 all deal with single bytes.)

The new idea here (AFAIK, at least for stdlib) is that PyUnicodeType wouldn't provide a bf_getbuffer (i.e. it wouldn't support PyObject_GetBuffer), since it needs additional info for the export. It would, however, support bf_releasebuffer.

The beauty of this design is that it can also do the zero-copy export (like PyUnicode_AsNativeFormat):

In other words, if you guess the internals correctly (or are prepared to handle enough alternative formats), you get zero-copy export. If not, your code still works correctly (as long as the string exportable to a format you can handle). I can't think of a use case where you'd need to know which of those happened, or where you'd need to check up-front.


PyUnicode_GetBuffer add some overhead over PyUnicode_AsNativeFormat:

If either of those are a performance issue, then you probably shouldn't be using the stable ABI ;)


If we do add the PyUnicode_BUFFER_* flags, then we should add a constructor for symmetry:

PyAPI_FUNC(int) PyUnicode_FromBuffer(
                       // (edit: removed mistaken argument)
    void *data,
    Py_ssize_t nbytes,
    int32_t data_format);

(It's not necessary to require the full Py_buffer struct here; its buf and len are easy enough to get if you have one.)


We've been expeoring this kind of API with @davidhewitt. A rough draft/exploration branch is at https://github.com/davidhewitt/cpython/commits/stringbuffer/; currently on hiatus for real-life reasons.

zooba commented 1 month ago

We could provide a function like this: ... with _supportedformats being flags indicating what the caller can take

Consider me a big fan (probably unsurprising given the similarity to my all-API forward compatibility proposal). An API that always works, is as fast as you let it be, and makes clear that you will need a fallback is about as ideal as I think things can get.

vstinner commented 1 month ago

I like the supported_formats format idea! It solves the problem of PyPy which uses UTF-8 and can surprise C extensions not designed/prepared for that.

Py_buffer is a bigger struct to fill. If that is a real problem, we should resurrect Victor's idea of a leaner buffer struct (but it'd still need a pointer, size, owner, and an internal field).

I abandonned my draft PEP PyResource for 2 reasons:

I dislike reusing PyBuffer API since it's heavy and complicated. I don't see the purpose of buffer_flags for example. PyBuffer supports non-contiguous memory, it supports memory shapes, and other complicated stuff. All Python implementations use contiguous memory for strings. If a Python implementation use non-contiguous memory, it should create it on demand.

I would prefer a more specialized ("simpler") API:

PyAPI_FUNC(int) PyUnicode_AsFormat(
    PyObject *unicode,
    int32_t supported_formats,
    const void *data,
    Py_ssize_t *size,
    int32_t *result_format);

PyAPI_FUNC(PyObject*) PyUnicode_FreeFormat(
    PyObject *unicode, // <= the interesting part
    const void data,
    Py_ssize_t size,
    int32_t format);

PyAPI_FUNC(PyObject*) PyUnicode_FromFormat(
    const void *data,
    Py_ssize_t size,
    int32_t format);

PyUnicode_AsFormat() would do nothing (just return a pointer) in most cases. In this case, PyUnicode_FreeFormat() does nothing.

PyUnicode_AsFormat() allocates memory if the request format is not supported. In this case, PyUnicode_FreeFormat() releases the memory. Passing unicode allows to know if the format requires to release memory or not.

encukou commented 1 month ago

PyBuffer supports non-contiguous memory, it supports memory shapes, and other complicated stuff. All Python implementations use contiguous memory for strings.

It does, but it also allows the callers to specify which features they can handle. (Hmm, I think I've seen that before!) That's what buffer_flags does: if you use PyBUF_SIMPLE, complex features won't be used.

I like reusing API we already have, but I won't die on that hill.

vstinner commented 1 month ago

@zooba @serhiy-storchaka: Do you prefer PyBuffer or pointer+size for these specific APIs?

zooba commented 1 month ago

I'm fine with either. Maybe +0 towards PyBuffer just so that the API doesn't necessarily have to be allocation-free (that is, the object being freed is different from the original one - otherwise, we'd have to track all additional allocations against the original object).

But it does feel a bit heavy for a case where the point is to look inside our internal data structures. I'd prefer that kind of operation to feel as rough as possible, to discourage people from doing it 😉

davidhewitt commented 1 month ago

Sorry to be slow to reply here, I started writing a reply describing the investigation I did with @encukou and then they managed to finish writing first 😁

I am happy with either the buffer or specialized form of AsFormat for consumption from PyO3. We can make either API safe on the Rust side.

You are welcome to take my branch commit which @encukou and I built if it's useful. Currently lacks tests IIRC 🫣

vstinner commented 4 weeks ago

I updated the PR to a new API:

#define PyUnicode_FORMAT_ASCII 0x01
#define PyUnicode_FORMAT_UCS1 0x02
#define PyUnicode_FORMAT_UCS2 0x04
#define PyUnicode_FORMAT_UCS4 0x08
#define PyUnicode_FORMAT_UTF8 0x10

// Get the content of a string in the requested format:
// - Return the content, set '*size' and '*format' on success.
// - Set an exception and return NULL on error.
//
// The export must be released by PyUnicode_ReleaseExport().
PyAPI_FUNC(const void*) PyUnicode_Export(
    PyObject *unicode,
    uint32_t supported_formats,
    Py_ssize_t *size,
    uint32_t *format);

// Release an export created by PyUnicode_Export().
PyAPI_FUNC(void) PyUnicode_ReleaseExport(
    PyObject *unicode,
    const void* data,
    uint32_t format);

// Create a string object from a string in the format 'format'.
// - Return a reference to a new string object on success.
// - Set an exception and return NULL on error.
PyAPI_FUNC(PyObject*) PyUnicode_Import(
    const void *data,
    Py_ssize_t size,
    uint32_t format);

PyUnicode_ReleaseExport() only has to free memory if the string kind is not UCS4 and the requested format is UCS4.

It's possible to export any string to UCS4 and UTF-8. In the UTF-8 case, the encoded UTF-8 string is cached into the Unicode object: similar to PyUnicode_AsUTF8().

Currently, I didn't implement exporting UCS1 to UCS2. Only export to UCS4 or UTF-8 is supported. Well, obviously, if the requested format include ASCII and UCS1, the native format is used (no copy or conversion needed).

EDIT: I changed the format type to uint32_t.

zooba commented 4 weeks ago

PyUnicode_ReleaseExport() only has to free memory if ...

Today, sure, but it protects us for the future. And allows other implementations to handle more cases.

I find "release export" a slightly uncomfortable name - did we have any other ideas? The API reminds me of LockBuffer and UnlockBuffer style APIs (or LockBits/UnlockBits, but I don't remember where that's from... bitmaps, perhaps?).

But I do prefer this naming to "native format", so consider me in favour, but happy to see a name change if someone comes up with a brilliant idea.

vstinner commented 3 weeks ago

did we have any other ideas?

I started with PyUnicode_FreeExport().

vstinner commented 3 weeks ago

Today, sure, but it protects us for the future. And allows other implementations to handle more cases.

Oh sure, I just described my current implementation.

vstinner commented 3 weeks ago

@encukou @serhiy-storchaka @da-woods: What do you think of the updated API https://github.com/python/cpython/issues/119609#issuecomment-2160673752?

encukou commented 3 weeks ago

I won't have time for a full review this week.

I used unsigned int for the format type. It was proposed to use uint32_t instead, I don't have a strong argument about it.

Please use uint32_t then. C only guarantees 16 bits in an int, but major platforms make it 32 bits. With int, 16 bits are “wasted” rather than “reserved for future formats”.

encukou commented 3 weeks ago

I won't have time to review it this week.

I used unsigned int for the format type. It was proposed to use uint32_t instead, I don't have a strong argument about it.

Please use uint32_t, then. int is commonly 32 bits, but only 16 are guaranteed. I'd like to see the extra 16 “reserved for future formats” rather than “wasted”. And there are generic arguments for fixed-width types; but that's all in another discussion.

vstinner commented 3 weeks ago

Please use uint32_t then.

Ok, done.

vstinner commented 2 weeks ago

@encukou:

I won't have time for a full review this week.

Ping. Do you have time for a review this week? :-)

encukou commented 2 weeks ago

Hi, I sent my suggestions in a PR: https://github.com/vstinner/cpython/pull/3 Let me know what you think :)

vstinner commented 2 weeks ago

Let me know what you think :)

There is an issue with your PR, it has something like 150 commits.

encukou commented 2 weeks ago

Yes, I merged in the main barnch to fix the conflicts. If you fix conflicts in your PR, I can rebase and just leave the last 6.

And of course I'd still prefer exporting PyBuffer, rather than making PyBuffer_Release take three arguments.

vstinner commented 2 weeks ago

@encukou: I integrated most of your suggestions in my PR. Please see the updated PR.

vstinner commented 2 weeks ago

I created Add PyUnicode_Export() and PyUnicode_Import() to the limited C API issue in the C API WG Decisions project.