[ctypes] [feature request] Create a Python string without buffer copies given a bytes pointer, size, kind

python / cpython

The Python programming language

https://www.python.org

Other

63.2k stars 30.26k forks source link

[ctypes] [feature request] Create a Python string without buffer copies given a bytes pointer, size, kind #104689

Closed vadimkantorov closed 1 month ago

vadimkantorov commented 1 year ago

In interop scenarios, it might be useful to be able to have a Python string referencing an existing buffer without copies (e.g. if the underlying char data is stored in NumPy/PyTorch tensors, accessing these char buffers with a standard Python interface is helpful for debugging and sometimes for perf).

I think it currently might be possible with ctypes and making use of existing PyUnicode/PyASCIIobject object layout and resetting the size/data fields to my own values.

I agree that usefulness over copying the byte buffer is not very prominent, but still might be useful in some specific scenarios: e.g. by mmap'ing a giant string from a disk file and being able to examine it in an easy way

sunmy2019 commented 1 year ago

CPython string does not need to be copied since they are immutable.

mmap'ing a giant string

Use mmap.mmap

making use of existing PyUnicode/PyASCIIobject object

A lot of code needs to be changed. Cost > Benefits.

sunmy2019 commented 1 year ago

You can always use Python buffer protocol in your use cases.

vadimkantorov commented 1 year ago

I agree that the usecase is narrow, but if the hack I'm thinking of:

construct an empty or 1-character-long string with the user-provided kind
replace size and data fields with user-provided values

is possible, then this function can be relatively simple to implement in ctypes module without any redesign of string data structures, and useful in some debugging/zero-copy interop scenarios (e.g. if the underlying char data is stored in NumPy/PyTorch tensors)

encukou commented 1 month ago

This is impossible with current str (and bytes) memory layout, where the data directly follows the header. Adding pointer indirection (and ownership/lifetime tracking) would be a big change, and it would likely get general C API rather than only live in ctypes.

Please do use the buffer protocol (memoryview) for this; that's made for the use cases you mention.

If the the buffer protocol is not enough, please discuss large ideas like this on Discourse. This issue tracker isn't a good place.

vadimkantorov commented 1 month ago

But is currently the char content gets accessed via some data pointer also stored in the string? If so, this data pointer field could be used for setting up this indirection. If currently the layout has no explicit pointer field, I agree this is impossible currently.

Noted about the Discourse, thanks!

encukou commented 1 month ago

Include/cpython/unicodeobject.h in the source for the current layout(s). There is indeed no pointer, data follows the header directly. Even if there was a pointer:

Public API would also need to consider future changes we want to make, and possibly supporting other implementations.
Who calls free() on the data? Or should it be munmap()?

vadimkantorov commented 1 week ago

Thanks for the reference to the relevant header.

/* Non-ASCII strings allocated through PyUnicode_New use the
   PyCompactUnicodeObject structure. state.compact is set, and the data
   immediately follow the structure. */
typedef struct {
    PyASCIIObject _base;
    Py_ssize_t utf8_length;     /* Number of bytes in utf8, excluding the
                                 * terminating \0. */
    char *utf8;                 /* UTF-8 representation (null-terminated) */
} PyCompactUnicodeObject;

/* Object format for Unicode subclasses. */
typedef struct {
    PyCompactUnicodeObject _base;
    union {
        void *any;
        Py_UCS1 *latin1;
        Py_UCS2 *ucs2;
        Py_UCS4 *ucs4;
    } data;                     /* Canonical, smallest-form Unicode buffer */
} PyUnicodeObject;

Are char *utf8; and void* data not these explicit pointers fields? (Yeah, I realize that in a typical Python str allocation, the actual data buffer will follow in the memory, but these looked like those pointers which could theoretically point somewhere else and not just to the trailing actual data)