vstinner commented 1 month ago

Feature or enhancement

PEP 393 – Flexible String Representation changed the Unicode implementation in Python 3.3 to use 3 string "kinds":

PyUnicode_KIND_1BYTE (UCS-1): ASCII and Latin1, [U+0000; U+00ff] range.
PyUnicode_KIND_2BYTE (UCS-2): BMP, [U+0000; U+ffff] range.
PyUnicode_KIND_4BYTE (UCZ-4): Full Unicode Character Set, [U+0000; U+10ffff] range.

Strings must always use the optimal storage: ASCII string must be stored as PyUnicode_KIND_2BYTE.

Strings have a flag indicating if the string only contains ASCII characters: [U+0000; U+007f] range. It's used by multiple internal optimizations.

This implementation is not leaked in the limited C API. For example, the PyUnicode_FromKindAndData() function is excluded from the stable ABI. Said differently, it's not possible to write efficient code for PEP 393 using the limited C API.

I propose adding two functions:

PyUnicode_AsNativeFormat(): export to the native format
PyUnicode_FromNativeFormat(): import from the native format

These functions are added to the limited C API version 3.14.

Native formats (new constants):

PyUnicode_NATIVE_ASCII: ASCII string.
PyUnicode_NATIVE_UCS1: UCS-1 string.
PyUnicode_NATIVE_UCS2: UCS-2 string.
PyUnicode_NATIVE_UCS4: UCS-4 string.
PyUnicode_NATIVE_UTF8: UTF-8 string (CPython implementation detail: only supported for import, not used by export).

Differences with PyUnicode_FromKindAndData():

Size is a number of bytes. For example, a single UCS-2 character is counted as 2 bytes.
Add PyUnicode_NATIVE_ASCII and PyUnicode_NATIVE_UTF8 formats.

PyUnicode_NATIVE_ASCII format allows further optimizations.

PyUnicode_NATIVE_UTF8 can be used by PyPy and other Python implementation using UTF-8 as the internal storage.

API:

#define PyUnicode_NATIVE_ASCII 1
#define PyUnicode_NATIVE_UCS1 2
#define PyUnicode_NATIVE_UCS2 3
#define PyUnicode_NATIVE_UCS4 4
#define PyUnicode_NATIVE_UTF8 5

// Get the content of a string in its native format.
// - Return the content, set '*size' and '*native_format' on success.
// - Set an exception and return NULL on error.
PyAPI_FUNC(const void*) PyUnicode_AsNativeFormat(
    PyObject *unicode,
    Py_ssize_t *size,
    int *native_format);

// Create a string object from a native format string.
// - Return a reference to a new string object on success.
// - Set an exception and return NULL on error.
PyAPI_FUNC(PyObject*) PyUnicode_FromNativeFormat(
    const void *data,
    Py_ssize_t size,
    int native_format);

See the attached pull request for more details.

This feature was requested to me to port the MarkupSafe C extension to the limited C API. Currently, each release requires producing around 60 wheel files which takes 20 minutes to build: https://pypi.org/project/MarkupSafe/#files

Using the stable ABI would reduce the number of wheel packages and so ease their release process.

See src/markupsafe/_speedups.c: string functions specialized for the 3 string kinds (UCS-1, UCS-2, UCS-4).

Linked PRs

gh-119610

vstinner commented 1 month ago

cc @serhiy-storchaka @encukou

vstinner commented 1 month ago

cc @davidism

encukou commented 1 month ago

I believe we can do better. I'll write a longer reply this week.

zooba commented 1 month ago

I believe we can do better.

My gut feeling as well. I don't think we want to encourage authors to specialise their own code when using the limited API. An efficient, abstraction-agnostic, multiple find-replace should be our job (if it's an important scenario, which I'm inclined to think it probably is).

vstinner commented 1 month ago

My gut feeling as well. I don't think we want to encourage authors to specialise their own code when using the limited API.

C extensions already specialize their code for the Unicode native format (ex: MarkupSafe). The problem is that they cannot update their code to the limited API because of missing features to export to/import from native formats.

An efficient, abstraction-agnostic, multiple find-replace should be our job (if it's an important scenario, which I'm inclined to think it probably is).

That sounds like a very specific API solving only one use case.

zooba commented 1 month ago

C extensions already specialize their code for the Unicode native format

Which means we forced them into it, not that they wanted to. Just because we did something not great for them in the past means we have to double-down on it into the future.

That sounds like a very specific API solving only one use case.

The same could be said about your proposed API. Do you have more use cases?

I'm not going to argue heavily about this case until Petr provides his writeup. I haven't given it a huge amount of thought, just felt it was worth adding a +1 to "this doesn't feel great".

vstinner commented 1 month ago

The same could be said about your proposed API.

What do you mean? I don't force users to use this API. It's an API which can be used to specialize code for UCS1, UCS2 and UCS4: it's the only way to write the most efficient code for the current implementation of Python.

Do you have more use cases?

Python itself has a wide library to specialize code for UCS1 / UCS2 / UCS4 in Objects/stringlib/:

Objects/stringlib/asciilib.h
Objects/stringlib/codecs.h
Objects/stringlib/count.h
Objects/stringlib/ctype.h
Objects/stringlib/eq.h
Objects/stringlib/fastsearch.h
Objects/stringlib/find.h
Objects/stringlib/find_max_char.h
Objects/stringlib/join.h
Objects/stringlib/localeutil.h
Objects/stringlib/partition.h
Objects/stringlib/replace.h
Objects/stringlib/split.h
Objects/stringlib/stringdefs.h
Objects/stringlib/transmogrify.h
Objects/stringlib/ucs1lib.h
Objects/stringlib/ucs2lib.h
Objects/stringlib/ucs4lib.h
Objects/stringlib/undef.h
Objects/stringlib/unicode_format.h

There are different operations which are already optimized. The problem is that the macro PyUnicode_WRITE_CHAR() is implemented with 2 tests, it's inefficient. Well, in most cases, PyUnicode_WRITE_CHAR() is good enough. But not when you want the optimal case, such as MarkupSafe which has a single function optimized in C.

There are 27 projects using PyUnicode_FromKindAndData() according to a code search in PyPI top 7,500 projects (2024-03-16):

Cython (3.0.9)
Levenshtein (0.25.0)
PyICU (2.12)
PyICU-binary (2.7.4)
PyQt5 (5.15.10)
PyQt6 (6.6.1)
aiocsv (1.3.1)
asyncpg (0.29.0)
biopython (1.83)
catboost (1.2.3)
cffi (1.16.0)
mojimoji (0.0.13)
mwparserfromhell (0.6.6)
numba (0.59.0)
numpy (1.26.4)
orjson (3.9.15)
pemja (0.4.1)
pyahocorasick (2.0.0)
pyjson5 (1.6.6)
rapidfuzz (3.6.2)
rcssmin (1.1.2)
regex (2023.12.25)
rjsmin (1.2.2)
srsly (2.4.8)
tokenizers (0.15.2)
ujson (5.9.0)
unicodedata2 (15.1.0)

There are already other, less efficient, ways to export a string, depending on its maximum character. But some of these functions are excluded from the limited C API.

PyUnicode_AsUCS4() and PyUnicode_AsUCS4Copy() -- 4 bytes per character, it can waste memory
PyUnicode_AsWideChar() and PyUnicode_AsWideCharString() -- need to handle surrogate pairs on Windows
PyUnicode_AsUTF8String()
PyUnicode_AsUTF16String()
and all other encoding functions

Proposed PyUnicode_AsNativeFormat() has a complexity of O(1) which is an important property, whereas these functions has at least a complexity of O(n) if not worse.

~~Note: I don't understand why there is no PyUnicode_FromUCS4() function. So it's not possible to re-import a UCS4 string.~~ Importing UCS4 can be done using PyUnicode_FromKindAndData(PyUnicode_FromKindAndData).

zooba commented 1 month ago

it's the only way to write the most efficient code for the current implementation of Python

"Most efficient code" is not the job of the limited API. At a certain level of exposure to internals, you need to stop using the limited API, or else accept that your performance will be reduced.

If the AsUTF8/FromUTF8 functions are not fast enough for markupsafe's multi-replace function, then as I said, I'd entertain a multi-replace function that doesn't leak implementation details, because that suits the limited API. I'd also consider some kind of builder API, which I believe you also have a proposal for.

But this proposal is literally about leaking implementation details. That is against the intent of the limited API, and so I am against the proposal.

Thanks for making me think about it, you've upgraded me from "unsure" to "sure" ;) Still looking forward to Petr's thoughts.

vstinner commented 1 month ago

@davidhewitt: Would Rust benefit from such "native format" API? Or does Rust just prefer UTF-8 and then decode UTF-8 in Python?

@da-woods: Would Cython benefit from such "native format" API? I see that Cython uses the following code in __Pyx_PyUnicode_Substring():

#if CYTHON_COMPILING_IN_LIMITED_API
    // PyUnicode_Substring() does not support negative indexing but is otherwise fine to use.
    return PyUnicode_Substring(text, start, stop);
#else
    return PyUnicode_FromKindAndData(PyUnicode_KIND(text),
        PyUnicode_1BYTE_DATA(text) + start*PyUnicode_KIND(text), stop-start);
#endif

zooba commented 1 month ago

At the very least, please rename it to "InternalFormat" instead of "NativeFormat".

At first I thought this was going to be a great API for following the conventions of the native machine, since the name led me incorrectly. It took me a few reads to figure out that it was telling me the format, not that I was getting to choose it.

serhiy-storchaka commented 1 month ago

I looked how some of these projects use PyUnicode_FromKindAndData.

Cython uses PyUnicode_Substring instead of PyUnicode_FromKindAndData in the limited API. https://github.com/cython/cython/blob/cf10ea12e3fc3637cf14e1bf962b295f53cc1a50/Cython/Utility/StringTools.c#L562-L568
Levenshtein only uses PyUnicode_FromKindAndData with PyUnicode_4BYTE_KIND. It could use PyUnicode_DecodeUTF32. https://github.com/rapidfuzz/Levenshtein/blob/0c6d3dc4f3cdf80ce19392fd502733d80e6a5a5d/src/Levenshtein/levenshtein_cpp.pyx#L101
PyQt5 only uses PyUnicode_FromKindAndData with PyUnicode_2BYTE_KIND. It could use PyUnicode_DecodeUTF16. How does Qt5 represents non-BMP characters? If it uses UTF-16, then PyUnicode_DecodeUTF16 is the only correct solution. https://github.com/baoboa/pyqt5/blob/11d5f43bc6f213d9d60272f3954a0048569cfc7c/designer/pluginloader.cpp#L174 https://github.com/baoboa/pyqt5/blob/11d5f43bc6f213d9d60272f3954a0048569cfc7c/qmlscene/pluginloader.cpp#L336
aiocsv only uses PyUnicode_FromKindAndData with PyUnicode_4BYTE_KIND. https://github.com/MKuranowski/aiocsv/blob/0d8499a68cfe60b38d9f8fbab8c0b0c9f1b54c4a/aiocsv/_parser.c#L464
asyncpg uses PyUnicode_FromKindAndData via Cython either with PyUnicode_1BYTE_KIND for short ASCII-only UUID strings (PyUnicode_FromString/PyUnicode_DecodeASCII/PyUnicode_DecodeLatin1/PyUnicode_DecodeUTF8 can be used here) or with PyUnicode_4BYTE_KIND to create substrings from the buffer created by PyUnicode_AsUCS4Copy (PyUnicode_Substring can be a little more efficient here). https://github.com/lemoncode21/fastapi-reactjs-loginpage/blob/c156bf187d061030e221553b7bc94978839ec994/backend/venv/Lib/site-packages/asyncpg/pgproto/uuid.pyx#L168-L170 https://github.com/MIoCJluTeJllo/betwatchBot/blob/0c634eab1c47d6c421de2ab074d047658d2e362e/venv/Lib/site-packages/asyncpg/protocol/codecs/array.pyx#L644-L647

They usually use PyUnicode_FromKindAndData in only one or two places. PyUnicode_FromKindAndData always can be replaced with Latin1/UTF16/UTF32 decoder (with the "surrogatepass" error handler for UTF16/UTF32). I think this is why it was not included in the limited C API. The examined project do not need to use PyUnicode_FromKindAndData, there is always alternative in the limited C API.

I do not see how PyUnicode_AsNativeFormat and PyUnicode_FromNativeFormat could help in the examined projects. They do not need conversion to/from internal format, they need conversion to/from a specific format. To utilize the benefit of working with the internal representation they need to reorganize their code in stringlib-like way, with a lot of macros and preprocessor directives (is it possible to do in Cython?) It has high implementation and maintenance cost. And it will likely will not work with PyUnicode_NATIVE_UTF8. And in some cases you need the UCS4 output buffer, because the stringlib-like way is bad for two-parameter specialization.

da-woods commented 1 month ago

The Cython usage discussed above is an attempt at micro-optimizing slicing unicode. I think the need to handle UTF-8 as a possibility would mean we couldn't usefully use this as a replacement here (because of the variable character length). In this case I think PyUnicode_Substring is exactly the right thing to use with the level of visibility we have in the Limited API.

There's maybe a few places we could use PyUnicode_AsNativeFormat - possibly in optimizing things like __Pyx_PyUnicode_Equals. I'm not sure that one case is a hugely compelling use-case - the current fallback isn't awful. I think for a lot of uses the possibility of having to handle UTF8 would limit their usefulness because quick indexing isn't an option (although we could always special-case that to a fallback).

PyUnicode_FromNativeFormat would probably be more useful (maybe not for us to use directly, but as an API that users receiving text data from C might want to use).

vstinner commented 1 month ago

I think for a lot of uses the possibility of having to handle UTF8 would limit their usefulness because quick indexing isn't an option (although we could always special-case that to a fallback).

PyUnicode_NATIVE_UTF8 is not used internally by Python, so PyUnicode_AsNativeFormat() doesn't use it. I added it for PyPy, if PyPy would like to adopt this C API. Maybe I can remove it to avoid confusion.

da-woods commented 1 month ago

I added it for PyPy, if PyPy would like to adopt this C API. Maybe I can remove it to avoid confusion.

Is it OK for an implementation to have a native format that's not in the fixed list? And if so then should they set an exception? That suggests that to be truly compatible extension authors should be prepared to have a fallback non-optimized implementation for valid unicode objects without a native format?

encukou commented 1 month ago

The result of PyUnicode_AsNativeBuffer is only valid as long as you have a reference to the unicode object. IMO, if we add the function as proposed, it should have Borrow in the name, even though the thing that's borrowed isn't a Python object. But, as I said, I think we can do better.

I don't think the limited API should avoid optimizations. We just need to support the API & ABI. We shouldn't guarantee that a certain function remains fast in the future (for that, users do need to benchmark and sometimes re-build), but if something is fast now and still supportable tomorrow, IMO we can add it.

Anyway, the issue I have with the proposed API is:

users should always add fallback code in case the function fails or the resulting format is one they don't understand. I don't think the API, as proposed, makes it easy to write such fallback. (For one thing, it's not clear enough what the fallback should be -- PyUnicode_AsUTF8?).
it exposes an implementation detail: strings are backed by some contiguous buffer. If we (or another implementation) change this detail in the future, we'd need to add a new failure mode to PyUnicode_AsNativeFormat (an exception), while the situation should be handled the same way as an unknown result format -- i.e. with a fallback.

What should the fallback be? Generally, if we can't do a zero-copy export (as PyUnicode_AsNativeFormat does) there are two next-best options:

Export to a user-allocated buffer, or
Have the API function allocate the buffer, and provide a way to free it when done.

I'll skip exploring the former here. (I did think about that part of that design space but didn't find much of interest: it lead me to an API much like PyLong_AsNativeBytes, with an ominously inviting rabbit hole of array-export features like chunking that we should probably design for lists first.)

For the latter, we have a rather nice mechanism: PyBuffer. We could provide a function like this:

PyAPI_FUNC(int) PyUnicode_GetBuffer(
    PyObject *unicode,
    Py_buffer *buf,
    int buffer_flags,  // as for PyObject_GetBuffer
    int32_t supported_formats,
    int32_t *result_format);

with supported_formats being flags indicating what the caller can take, OR-ed together:

#define PyUnicode_BUFFER_FORMAT_ASCII 0x01
#define PyUnicode_BUFFER_FORMAT_UCS1 0x02
#define PyUnicode_BUFFER_FORMAT_UCS2 0x04
#define PyUnicode_BUFFER_FORMAT_UCS4 0x08
#define PyUnicode_BUFFER_FORMAT_UTF8 0x10

On success, *result_format would be set to the format of the result, i.e. just one of the flags. (We can't use Py_buffer.format since ASCII, UCS1 and UTF8 all deal with single bytes.)

The new idea here (AFAIK, at least for stdlib) is that PyUnicodeType wouldn't provide a bf_getbuffer (i.e. it wouldn't support PyObject_GetBuffer), since it needs additional info for the export. It would, however, support bf_releasebuffer.

The beauty of this design is that it can also do the zero-copy export (like PyUnicode_AsNativeFormat):

if one of the requested formats matches the unicode object's underlying buffer, the underlying buffer exported through Py_buffer. Note that Py_buffer holds a reference to the exporting string, so we don't get “borrowing” issues.
if the requested format doesn't match exactly, but can be converted, PyUnicode_GetBuffer allocates memory, does the conversion, and sets a flag in PyBuffer.internal telling bf_releasebuffer to free the memory.

In other words, if you guess the internals correctly (or are prepared to handle enough alternative formats), you get zero-copy export. If not, your code still works correctly (as long as the string exportable to a format you can handle). I can't think of a use case where you'd need to know which of those happened, or where you'd need to check up-front.

PyUnicode_GetBuffer add some overhead over PyUnicode_AsNativeFormat:

Py_buffer is a bigger struct to fill. If that is a real problem, we should resurrect Victor's idea of a leaner buffer struct (but it'd still need a pointer, size, owner, and an internal field).
You need to call PyBuffer_Release. (If we want to avoid “borrowing”, we'll always need some variation of a drop function.)

If either of those are a performance issue, then you probably shouldn't be using the stable ABI ;)

If we do add the PyUnicode_BUFFER_* flags, then we should add a constructor for symmetry:

PyAPI_FUNC(int) PyUnicode_FromBuffer(
                       // (edit: removed mistaken argument)
    void *data,
    Py_ssize_t nbytes,
    int32_t data_format);

(It's not necessary to require the full Py_buffer struct here; its buf and len are easy enough to get if you have one.)

We've been expeoring this kind of API with @davidhewitt. A rough draft/exploration branch is at https://github.com/davidhewitt/cpython/commits/stringbuffer/; currently on hiatus for real-life reasons.

zooba commented 1 month ago

We could provide a function like this: ... with _supportedformats being flags indicating what the caller can take

Consider me a big fan (probably unsurprising given the similarity to my all-API forward compatibility proposal). An API that always works, is as fast as you let it be, and makes clear that you will need a fallback is about as ideal as I think things can get.

vstinner commented 1 month ago

I like the supported_formats format idea! It solves the problem of PyPy which uses UTF-8 and can surprise C extensions not designed/prepared for that.

Py_buffer is a bigger struct to fill. If that is a real problem, we should resurrect Victor's idea of a leaner buffer struct (but it'd still need a pointer, size, owner, and an internal field).

I abandonned my draft PEP PyResource for 2 reasons:

Most users are happy with the rule: "the contents remains valid as long as a I hold a reference to this object". A majority of a the C API is based on this principle.
When it's really needed to "release a resource", users prefer a dedicated function rather than a generic "PyResource" API.

I dislike reusing PyBuffer API since it's heavy and complicated. I don't see the purpose of buffer_flags for example. PyBuffer supports non-contiguous memory, it supports memory shapes, and other complicated stuff. All Python implementations use contiguous memory for strings. If a Python implementation use non-contiguous memory, it should create it on demand.

I would prefer a more specialized ("simpler") API:

PyAPI_FUNC(int) PyUnicode_AsFormat(
    PyObject *unicode,
    int32_t supported_formats,
    const void *data,
    Py_ssize_t *size,
    int32_t *result_format);

PyAPI_FUNC(PyObject*) PyUnicode_FreeFormat(
    PyObject *unicode, // <= the interesting part
    const void data,
    Py_ssize_t size,
    int32_t format);

PyAPI_FUNC(PyObject*) PyUnicode_FromFormat(
    const void *data,
    Py_ssize_t size,
    int32_t format);

PyUnicode_AsFormat() would do nothing (just return a pointer) in most cases. In this case, PyUnicode_FreeFormat() does nothing.

PyUnicode_AsFormat() allocates memory if the request format is not supported. In this case, PyUnicode_FreeFormat() releases the memory. Passing unicode allows to know if the format requires to release memory or not.

encukou commented 1 month ago

PyBuffer supports non-contiguous memory, it supports memory shapes, and other complicated stuff. All Python implementations use contiguous memory for strings.

It does, but it also allows the callers to specify which features they can handle. (Hmm, I think I've seen that before!) That's what buffer_flags does: if you use PyBUF_SIMPLE, complex features won't be used.

I like reusing API we already have, but I won't die on that hill.

vstinner commented 1 month ago

@zooba @serhiy-storchaka: Do you prefer PyBuffer or pointer+size for these specific APIs?

zooba commented 1 month ago

I'm fine with either. Maybe +0 towards PyBuffer just so that the API doesn't necessarily have to be allocation-free (that is, the object being freed is different from the original one - otherwise, we'd have to track all additional allocations against the original object).

But it does feel a bit heavy for a case where the point is to look inside our internal data structures. I'd prefer that kind of operation to feel as rough as possible, to discourage people from doing it 😉

davidhewitt commented 1 month ago

Sorry to be slow to reply here, I started writing a reply describing the investigation I did with @encukou and then they managed to finish writing first 😁

I am happy with either the buffer or specialized form of AsFormat for consumption from PyO3. We can make either API safe on the Rust side.

You are welcome to take my branch commit which @encukou and I built if it's useful. Currently lacks tests IIRC 🫣

vstinner commented 4 weeks ago

I updated the PR to a new API:

#define PyUnicode_FORMAT_ASCII 0x01
#define PyUnicode_FORMAT_UCS1 0x02
#define PyUnicode_FORMAT_UCS2 0x04
#define PyUnicode_FORMAT_UCS4 0x08
#define PyUnicode_FORMAT_UTF8 0x10

// Get the content of a string in the requested format:
// - Return the content, set '*size' and '*format' on success.
// - Set an exception and return NULL on error.
//
// The export must be released by PyUnicode_ReleaseExport().
PyAPI_FUNC(const void*) PyUnicode_Export(
    PyObject *unicode,
    uint32_t supported_formats,
    Py_ssize_t *size,
    uint32_t *format);

// Release an export created by PyUnicode_Export().
PyAPI_FUNC(void) PyUnicode_ReleaseExport(
    PyObject *unicode,
    const void* data,
    uint32_t format);

// Create a string object from a string in the format 'format'.
// - Return a reference to a new string object on success.
// - Set an exception and return NULL on error.
PyAPI_FUNC(PyObject*) PyUnicode_Import(
    const void *data,
    Py_ssize_t size,
    uint32_t format);

PyUnicode_ReleaseExport() only has to free memory if the string kind is not UCS4 and the requested format is UCS4.

It's possible to export any string to UCS4 and UTF-8. In the UTF-8 case, the encoded UTF-8 string is cached into the Unicode object: similar to PyUnicode_AsUTF8().

Currently, I didn't implement exporting UCS1 to UCS2. Only export to UCS4 or UTF-8 is supported. Well, obviously, if the requested format include ASCII and UCS1, the native format is used (no copy or conversion needed).

EDIT: I changed the format type to uint32_t.

zooba commented 4 weeks ago

PyUnicode_ReleaseExport() only has to free memory if ...

Today, sure, but it protects us for the future. And allows other implementations to handle more cases.

I find "release export" a slightly uncomfortable name - did we have any other ideas? The API reminds me of LockBuffer and UnlockBuffer style APIs (or LockBits/UnlockBits, but I don't remember where that's from... bitmaps, perhaps?).

But I do prefer this naming to "native format", so consider me in favour, but happy to see a name change if someone comes up with a brilliant idea.

vstinner commented 3 weeks ago

did we have any other ideas?

I started with PyUnicode_FreeExport().

vstinner commented 3 weeks ago

Today, sure, but it protects us for the future. And allows other implementations to handle more cases.

Oh sure, I just described my current implementation.

vstinner commented 3 weeks ago

@encukou @serhiy-storchaka @da-woods: What do you think of the updated API https://github.com/python/cpython/issues/119609#issuecomment-2160673752?

I used unsigned int for the format type. It was proposed to use uint32_t instead, I don't have a strong argument about it.
@encukou would prefer to reuse the PyBuffer API with PyBuffer_Release().
@zooba is not a fan of PyUnicode_ReleaseExport() function name.

encukou commented 3 weeks ago

I won't have time for a full review this week.

I used unsigned int for the format type. It was proposed to use uint32_t instead, I don't have a strong argument about it.

Please use uint32_t then. C only guarantees 16 bits in an int, but major platforms make it 32 bits. With int, 16 bits are “wasted” rather than “reserved for future formats”.

encukou commented 3 weeks ago

I won't have time to review it this week.

I used unsigned int for the format type. It was proposed to use uint32_t instead, I don't have a strong argument about it.

Please use uint32_t, then. int is commonly 32 bits, but only 16 are guaranteed. I'd like to see the extra 16 “reserved for future formats” rather than “wasted”. And there are generic arguments for fixed-width types; but that's all in another discussion.

vstinner commented 3 weeks ago

Please use uint32_t then.

Ok, done.

vstinner commented 2 weeks ago

@encukou:

I won't have time for a full review this week.

Ping. Do you have time for a review this week? :-)

encukou commented 2 weeks ago

Hi, I sent my suggestions in a PR: https://github.com/vstinner/cpython/pull/3 Let me know what you think :)

vstinner commented 2 weeks ago

Let me know what you think :)

There is an issue with your PR, it has something like 150 commits.

encukou commented 2 weeks ago

Yes, I merged in the main barnch to fix the conflicts. If you fix conflicts in your PR, I can rebase and just leave the last 6.

Do we guarantee that the exported buffers zero-terminated? I assume we don't, to hide that implementation detail.
IMO, 1-byte strings should be exportable to UCS2.
I'd prefer renaming the size argument to nbytes to make the unit clear.

And of course I'd still prefer exporting PyBuffer, rather than making PyBuffer_Release take three arguments.

vstinner commented 2 weeks ago

@encukou: I integrated most of your suggestions in my PR. Please see the updated PR.

vstinner commented 2 weeks ago

I created Add PyUnicode_Export() and PyUnicode_Import() to the limited C API issue in the C API WG Decisions project.

python / cpython

[C API] Add PyUnicode_Export() and PyUnicode_Import() functions #119609

Feature or enhancement

Linked PRs