python / cpython

The Python programming language
https://www.python.org
Other
63.2k stars 30.26k forks source link

[C API] Add an efficient public PyUnicodeWriter API #119182

Closed vstinner closed 4 months ago

vstinner commented 5 months ago

Feature or enhancement

Creating a Python string object in an efficient way is complicated. Python has private _PyUnicodeWriter API. It's being used by these projects:

Affected projects (5):

I propose making the API public to promote it and help C extensions maintainers to write more efficient code to create Python string objects.

API:

typedef struct PyUnicodeWriter PyUnicodeWriter;

PyAPI_FUNC(PyUnicodeWriter*) PyUnicodeWriter_Create(void);
PyAPI_FUNC(void) PyUnicodeWriter_Discard(PyUnicodeWriter *writer);
PyAPI_FUNC(PyObject*) PyUnicodeWriter_Finish(PyUnicodeWriter *writer);

PyAPI_FUNC(void) PyUnicodeWriter_SetOverallocate(
    PyUnicodeWriter *writer,
    int overallocate);

PyAPI_FUNC(int) PyUnicodeWriter_WriteChar(
    PyUnicodeWriter *writer,
    Py_UCS4 ch);
PyAPI_FUNC(int) PyUnicodeWriter_WriteUTF8(
    PyUnicodeWriter *writer,
    const char *str,  // decoded from UTF-8
    Py_ssize_t len);  // use strlen() if len < 0
PyAPI_FUNC(int) PyUnicodeWriter_Format(
    PyUnicodeWriter *writer,
    const char *format,
    ...);

// Write str(obj)
PyAPI_FUNC(int) PyUnicodeWriter_WriteStr(
    PyUnicodeWriter *writer,
    PyObject *obj);

// Write repr(obj)
PyAPI_FUNC(int) PyUnicodeWriter_WriteRepr(
    PyUnicodeWriter *writer,
    PyObject *obj);

// Write str[start:end]
PyAPI_FUNC(int) PyUnicodeWriter_WriteSubstring(
    PyUnicodeWriter *writer,
    PyObject *str,
    Py_ssize_t start,
    Py_ssize_t end);

The internal writer buffer is overallocated by default. PyUnicodeWriter_Finish() truncates the buffer to the exact size if the buffer was overallocated.

Overallocation reduces the cost of exponential complexity when adding short strings in a loop. Use PyUnicodeWriter_SetOverallocate(writer, 0) to disable overallocation just before the last write.

The writer takes care of the internal buffer kind: Py_UCS1 (latin1), Py_UCS2 (BMP) or Py_UCS4 (full Unicode Character Set). It also implements an optimization if a single write is made using PyUnicodeWriter_WriteStr(): it returns the string unchanged without any copy.


Example of usage (simplified code from Python/unionobject.c):

static PyObject *
union_repr(PyObject *self)
{
    unionobject *alias = (unionobject *)self;
    Py_ssize_t len = PyTuple_GET_SIZE(alias->args);

    PyUnicodeWriter *writer = PyUnicodeWriter_Create();
    if (writer == NULL) {
        return NULL;
    }

    for (Py_ssize_t i = 0; i < len; i++) {
        if (i > 0 && PyUnicodeWriter_WriteUTF8(writer, " | ", 3) < 0) {
            goto error;
        }
        PyObject *p = PyTuple_GET_ITEM(alias->args, i);
        if (PyUnicodeWriter_WriteRepr(writer, p) < 0) {
            goto error;
        }
    }
    return PyUnicodeWriter_Finish(writer);

error:
    PyUnicodeWriter_Discard(writer);
    return NULL;
}

Linked PRs

vstinner commented 5 months ago

Benchmark using:

bench_concat: Mean +- std dev: 2.07 us +- 0.03 us
bench_writer: Mean +- std dev: 894 ns +- 13 ns

PyUnicodeWriter is 2.3x faster than PyUnicode_Concat()+PyUnicode_Append().

The difference comes from overallocation: if I add PyUnicodeWriter_SetOverallocate(writer, 0); after PyUnicodeWriter_Create(), PyUnicodeWriter has the same performance than PyUnicode_Concat()+PyUnicode_Append(). Overallocation avoids str += str quadratic complexity (well, at least, it reduces the complexity).

The PyUnicodeWriter API makes overallocation easy to use.

cc @serhiy-storchaka

vstinner commented 5 months ago

By the way, PyPy provides __pypy__.builders.StringBuilder for "Fast String Concatenation": https://doc.pypy.org/en/latest/__pypy__-module.html#fast-string-concatenation to work around the str += str quadratic complexity.

$ pypy3.9 
>>>> import __pypy__
>>>> b=__pypy__.builders.StringBuilder()
>>>> b.append('x')
>>>> b.append('=')
>>>> b.append('value')
>>>> b.build()
'x=value'
vstinner commented 5 months ago

Article about this performance problem in Python: https://lwn.net/Articles/816415/

gvanrossum commented 5 months ago

Curious if this warrants a further API PyUnicodeWriter_WriteStr(writer, obj) which appends repr(obj) (just as WriteStr(writer, obj) can be seen to append str(obj)), and eventually the development of a new type slot that writes the repr or str of an object to a writer rather than returning a string object. (And maybe even an "WriteAscii" to write ascii(obj)) and WriteFormat to do something with formats. :-)

I know, I know, hyper-generalization, yet this is what the Union example is screaming for... I suppose we can add those later.

How long has the internal writer API existed?

Would these be in the Stable ABI / Limited API from the start? (API-wise these look stable.)

vstinner commented 5 months ago

Curious if this warrants a further API PyUnicodeWriter_WriteStr(writer, obj) which appends repr(obj)

I suppose that you mean PyUnicodeWriter_WriteRepr().

Curious if this warrants a further API PyUnicodeWriter_WriteStr(writer, obj) which appends repr(obj) (just as WriteStr(writer, obj) can be seen to append str(obj)), and eventually the development of a new type slot that writes the repr or str of an object to a writer rather than returning a string object. (And maybe even an "WriteAscii" to write ascii(obj)) and WriteFormat to do something with formats. :-)

There is already a collection of helper function accepting a writer and I find this really cool. It's not "slot-based", since each function has many formatting options.

extern int _PyLong_FormatWriter(
    _PyUnicodeWriter *writer,
    PyObject *obj,
    int base,
    int alternate);

extern int _PyLong_FormatAdvancedWriter(
    _PyUnicodeWriter *writer,
    PyObject *obj,
    PyObject *format_spec,
    Py_ssize_t start,
    Py_ssize_t end);

extern int _PyFloat_FormatAdvancedWriter(
    _PyUnicodeWriter *writer,
    PyObject *obj,
    PyObject *format_spec,
    Py_ssize_t start,
    Py_ssize_t end);

extern int _PyComplex_FormatAdvancedWriter(
    _PyUnicodeWriter *writer,
    PyObject *obj,
    PyObject *format_spec,
    Py_ssize_t start,
    Py_ssize_t end);

extern int _PyUnicode_FormatAdvancedWriter(
    _PyUnicodeWriter *writer,
    PyObject *obj,
    PyObject *format_spec,
    Py_ssize_t start,
    Py_ssize_t end);

extern Py_ssize_t _PyUnicode_InsertThousandsGrouping(
    _PyUnicodeWriter *writer,
    Py_ssize_t n_buffer,
    PyObject *digits,
    Py_ssize_t d_pos,
    Py_ssize_t n_digits,
    Py_ssize_t min_width,
    const char *grouping,
    PyObject *thousands_sep,
    Py_UCS4 *maxchar);

These functions avoid memory copies. For example, _PyLong_FormatWriter() writes directly digits in the writter buffer, without the need of a temporary buffer.

How long has the internal writer API existed?

12 years: I added it in 2012.

commit 202fdca133ce8f5b0c37cca1353070e0721c688d
Author: Victor Stinner <victor.stinner@gmail.com>
Date:   Mon May 7 12:47:02 2012 +0200

    Close #14716: str.format() now uses the new "unicode writer" API instead of the
    PyAccu API. For example, it makes str.format() from 25% to 30% faster on Linux.

I wrote this API to fix the major performance regression after PEP 393 – Flexible String Representation was implemented. After my optimization work, many string operations on Unicode objects became faster than Python 2 operations on bytes! Especially when treating only ASCII characters which is the most common case. I mostly optimized str.format() and str % args where are powerful but complex.

In 2016, I wrote an article about the two "writer" APIs that I wrote to optimize: https://vstinner.github.io/pybyteswriter.html

Would these be in the Stable ABI / Limited API from the start? (API-wise these look stable.)

I would prefer to not add it to the limited C API directly, but wait one Python version to see how it goes.

gvanrossum commented 5 months ago

(Yes, I meant WriteRepr.) I like these other helpers -- can we just add them all to the public API? Or are there issues with any of them?

vstinner commented 5 months ago

(Yes, I meant WriteRepr.) I like these other helpers -- can we just add them all to the public API? Or are there issues with any of them?

I added the following function which should fit most of these use cases:

PyAPI_FUNC(int) PyUnicodeWriter_FromFormat(
    PyUnicodeWriter *writer,
    const char *format,
    ...);

Example to write repr(obj):

PyUnicodeWriter_FromFormat(writer, "%R", obj);

Example to write str(obj):

PyUnicodeWriter_FromFormat(writer, "%S", obj);

It's the same format than PyUnicode_FromFormat(). Example:

PyUnicodeWriter_FromFormat(writer, "Hello %s, %i.", "Python", 123);
encukou commented 5 months ago

Thank you, this looks very useful!

I see that PyUnicodeWriter_Finish frees the writer. That's great; it allows optimizations we can also use in other writers/builders in the future. (Those should have a consistent API.) One thing to note is that PyUnicodeWriter_Finish should free the writer even when an error occurs. Maybe PyUnicodeWriter_Free should be named e.g. PyUnicodeWriter_Discard to emphasize that you should only call it if you didn't Finish.

The va_arg function is problematic for non-C languages, but it's possible to get the functionality with other functions – especially if we add a number-writing helper, so I'm OK with adding it.

The proposed API is nice and minimal. My bet about what users will ask for next goes to PyUnicodeWriter_WriteUTF8String (for IO) & PyUnicodeWriter_WriteUTF16String (for Windows or Java interop).

Name bikeshedding:


I see the PR hides underscored API that some existing projects use. I thought we weren't doing that any more.

vstinner commented 5 months ago

PyUnicodeWriter_WriteUCS4Char rather than PyUnicodeWriter_WriteChar -- character is an overloaded term, let's be specific.

"WriteChar" name comes from PyUnicode_ReadChar() and PyUnicode_WriteChar() names. I don't think that mentioning UCS4 is useful.

PyUnicodeWriter_WriteFormat (or WriteFromFormat?) rather than PyUnicodeWriter_FromFormat -- it's writing, not creating a writer.

I would prefer just "PyUnicodeWriter_Format()". I prefer to not support str.format() which is more a "Python API" than a C API. It's less convenient to use in C. If we don't support str.format(), "PyUnicodeWriter_Format()" is fine for the "PyUnicode_FormFormat()" variant.

encukou commented 5 months ago

Yeah, PyUnicodeWriter_Format sounds good. It avoids the PyX_FromY scheme we use for constructing new objects.

I think that using unqualified Char for a UCS4 codepoint was a mistake we shouldn't continue, but I'm happy to be outvoted on that.

vstinner commented 5 months ago

The proposed API is nice and minimal. My bet about what users will ask for next goes to PyUnicodeWriter_WriteUTF8String (for IO) & PyUnicodeWriter_WriteUTF16String (for Windows or Java interop).

I propose to add PyUnicodeWriter_WriteString() which decodes from UTF-8 (in strict mode).

PyUnicodeWriter_WriteASCIIString() has an undefined behavior if the string contains non-ASCII characters. Maybe it should be removed in favor of PyUnicodeWriter_WriteString() which is safer (well defined behavior for non-ASCII characters: decode them from UTF-8).

serhiy-storchaka commented 5 months ago

The main problem with the current private PyUnicodeWriter C API is that it requires allocating the PyUnicodeWriter value on the stack, but its layout is an implementation detail, and exposing such API would prevent future changes. The proposed new C API allocates the data in dynamic memory, which makes it more portable and future proof. But this can add additional overhead. Also, if we use dynamic memory, why not make PyUnicodeWriter a subclass of PyObject? Then Py_DECREF could be used to destroy it, we could store multiple writers in a collection, and we can even provide Python interface for it.

vstinner commented 5 months ago

The proposed new C API allocates the data in dynamic memory, which makes it more portable and future proof. But this can add additional overhead.

I ran benchmarks and using the proposed public API remains interesting in terms of performance: see benchmarks below.

Also, if we use dynamic memory, why not make PyUnicodeWriter a subclass of PyObject? Then Py_DECREF could be used to destroy it, we could store multiple writers in a collection, and we can even provide Python interface for it.

Adding a Python API is appealing, but I prefer to restrict this discussion to a C API and only discuss later the idea of exposing it at the Python level.

For the C API, I don't think that Py_DECREF() semantics and inheriting from PyObject are really worth it.

vstinner commented 5 months ago

I renamed functions:

vstinner commented 5 months ago

@encukou:

I see the PR hides underscored API that some existing projects use. I thought we weren't doing that any more.

Right, I would like to hide/remove the internal API from the public C API in Python 3.14 while adding the new public C API. The private _PyUnicodeWriter API exposes the _PyUnicodeWriter structure (members). Its API is more complicated and more error-prone.

I prepared a PR for pythoncapi-compat to check that it's possible to implement the new API on Python 3.6-3.13: https://github.com/python/pythoncapi-compat/pull/95

serhiy-storchaka commented 5 months ago

There is some confusion with names. The String suffix usually means the C string (const char *) argument. Str is only used in PyObject_Str() which is the C analogue of the str() function.

So, for consistency we should use PyUnicodeWriter_WriteString() for writing the C string. This left us with the question what to do with Python strings. PyUnicodeWriter_WriteStr() implies that str() is called for argument. Even if we add such API, it is worth to have also a more restricted function which fails if non-string is passed by accident.

vstinner commented 5 months ago

This left us with the question what to do with Python strings.

We can refer to them as "Unicode", such as: PyUnicodeWriter_WriteUnicode(). Even if the Python type is called "str", in C, it's the PyUnicodeObject: https://docs.python.org/dev/c-api/unicode.html

serhiy-storchaka commented 5 months ago

Unfortunately Unicode as a suffix was used in the legacy C API related to Py_UNICODE *. Currently it is only used in the C API to "unicode-escape" and "raw-unicode-escape", so we could restore it with a new meaning, but it will be the first case of using it in this role.

Perhaps we can just omit any suffix and use PyUnicodeWriter_Write()?

vstinner commented 5 months ago

About bikeshedding. PyPy provides __pypy__.builders.StringBuilder with append() and build() methods. Do you think that "String Builder" with append and build methods API (names) makes more sense than "Unicode Writer" with write and finish methods?

encukou commented 5 months ago

I'd prefer a PyUnicodeWriter_WriteStr that calls PyObject_Str() (which should be cheap for actual PyUnicode objects). It would pair well with the proposed future additions, PyUnicodeWriter_WriteRepr & PyUnicodeWriter_WriteAscii :)

PyUnicodeWriter_WriteSubstring can still take PyUnicode only.

PyUnicodeWriter_WriteUTF8 is a good name. Do you want to support zero-terminated strings (e.g. by passing -1 as the length)?

vstinner commented 5 months ago

I'd prefer a PyUnicodeWriter_WriteStr that calls PyObject_Str() (which should be cheap for actual PyUnicode objects). It would pair well with the proposed future additions, PyUnicodeWriter_WriteRepr & PyUnicodeWriter_WriteAscii :)

As written previously, you can already use:

Currently, there is no optimization for these code paths. It's the same as creating a temporary string, write the string, delete the string. It's just a convenient API for that. Later we can imagine further optimizations.

Proposed PyUnicodeWriter_WriteStr() / PyUnicodeWriter_WriteString() (not sure about the name) fails with TypeError if the argument is not a Python str object. I'm only looking for a good name for the name function. I don't want to call PyObject_Str(). The API is really designed for performance. It should do at least work as possible and have a straightforward API.

If later we consider that a new function would be added, I would prefer PyUnicodeWriter_WriteObjectStr() name for str(obj).

PyUnicodeWriter_WriteUTF8 is a good name. Do you want to support zero-terminated strings (e.g. by passing -1 as the length)?

I didn't write the API documentation yet. It's already supported, passing -1 already calls strlen().

vstinner commented 5 months ago

PyUnicodeWriter_WriteSubstring can still take PyUnicode only.

Right, it raises TypeError if the argument is not a Python str objet. Same than PyUnicodeWriter_WriteString().

encukou commented 5 months ago

IMO, the best name is PyUnicodeWriter_WriteStr, except it's a bit ambiguous -- people might expect it to call str(). We can solve the ambiguity by simply making it do that, as a convenience to the user. It won't affect performance in any meaningful way.

vstinner commented 5 months ago

IMO, the best name is PyUnicodeWriter_WriteStr, except it's a bit ambiguous -- people might expect it to call str().

I don't see why users would expect that. I don't know any existing API with a similar name which call str(), only PyObject_Str() calls it. If it's ambiguous, we can make it explicit in the documentation.

It won't affect performance in any meaningful way.

It's not about performance, but the API. I want a function to only write a string, and nothing else.

gvanrossum commented 5 months ago

It's not about performance, but the API. I want a function to only write a string, and nothing else.

But why? This API feels more like print() (which implicitly calls str() if needed) or f-string interpolation (which does something similar) and less like TextIO.write() (which insists on a str instance). I like this convenience.

vstinner commented 5 months ago

My issue is that my proposed API is based on an existing implementation which is around for 12 years. It's uneasy for me to think "ouf of the box" to design a new better API, but that's why I opened this discussion :-) To get other opinions to help me to design a better usable API.

If the majority prefers calling str(), ok, let's switch to that for PyUnicodeWriter_WriteStr().

I checked the Python code base, there are a few code places using repr() with a writer: dict, list, tuple, union, context, token, etc. So I propose to add also PyUnicodeWriter_WriteRepr().

vstinner commented 5 months ago

Update:

vstinner commented 5 months ago

I opened an issue for the C API Working Group: https://github.com/capi-workgroup/decisions/issues/27

vstinner commented 4 months ago

I opened an issue for the C API Working Group: https://github.com/capi-workgroup/decisions/issues/27

API approved. I merged my PR.

vstinner commented 4 months ago

I ran again https://github.com/python/cpython/issues/119182#issuecomment-2119488134 benchmark: PyUnicodeWriter is 1.8x faster than string concatenation.

$ env/bin/python bench.py 
.....................
bench_concat: Mean +- std dev: 1.13 us +- 0.02 us
.....................
bench_writer: Mean +- std dev: 632 ns +- 22 ns
vstinner commented 4 months ago

The API was added, it comes with its test suite (in Python!), and many string types are now supported (UTF-8, wide string, UCS4, etc.). I close the issue.

vstinner commented 3 months ago

See also https://github.com/python/cpython/issues/121710 : [C API] Add PyBytesWriter API.