Closed vadimkantorov closed 1 month ago
CPython string does not need to be copied since they are immutable.
mmap'ing a giant string
Use mmap.mmap
making use of existing PyUnicode/PyASCIIobject object
A lot of code needs to be changed. Cost > Benefits.
You can always use Python buffer protocol in your use cases.
I agree that the usecase is narrow, but if the hack I'm thinking of:
size
and data
fields with user-provided valuesis possible, then this function can be relatively simple to implement in ctypes
module without any redesign of string data structures, and useful in some debugging/zero-copy interop scenarios (e.g. if the underlying char data is stored in NumPy/PyTorch tensors)
This is impossible with current str
(and bytes
) memory layout, where the data directly follows the header. Adding pointer indirection (and ownership/lifetime tracking) would be a big change, and it would likely get general C API rather than only live in ctypes
.
Please do use the buffer protocol (memoryview
) for this; that's made for the use cases you mention.
If the the buffer protocol is not enough, please discuss large ideas like this on Discourse. This issue tracker isn't a good place.
But is currently the char content gets accessed via some data
pointer also stored in the string? If so, this data
pointer field could be used for setting up this indirection. If currently the layout has no explicit pointer field, I agree this is impossible currently.
Noted about the Discourse, thanks!
Include/cpython/unicodeobject.h
in the source for the current layout(s). There is indeed no pointer, data follows the header directly.
Even if there was a pointer:
free()
on the data? Or should it be munmap()
?Thanks for the reference to the relevant header.
/* Non-ASCII strings allocated through PyUnicode_New use the
PyCompactUnicodeObject structure. state.compact is set, and the data
immediately follow the structure. */
typedef struct {
PyASCIIObject _base;
Py_ssize_t utf8_length; /* Number of bytes in utf8, excluding the
* terminating \0. */
char *utf8; /* UTF-8 representation (null-terminated) */
} PyCompactUnicodeObject;
/* Object format for Unicode subclasses. */
typedef struct {
PyCompactUnicodeObject _base;
union {
void *any;
Py_UCS1 *latin1;
Py_UCS2 *ucs2;
Py_UCS4 *ucs4;
} data; /* Canonical, smallest-form Unicode buffer */
} PyUnicodeObject;
Are char *utf8;
and void* data
not these explicit pointers fields? (Yeah, I realize that in a typical Python str allocation, the actual data buffer will follow in the memory, but these looked like those pointers which could theoretically point somewhere else and not just to the trailing actual data)
In interop scenarios, it might be useful to be able to have a Python string referencing an existing buffer without copies (e.g. if the underlying char data is stored in NumPy/PyTorch tensors, accessing these char buffers with a standard Python interface is helpful for debugging and sometimes for perf).
I think it currently might be possible with ctypes and making use of existing PyUnicode/PyASCIIobject object layout and resetting the size/data fields to my own values.
I agree that usefulness over copying the byte buffer is not very prominent, but still might be useful in some specific scenarios: e.g. by mmap'ing a giant string from a disk file and being able to examine it in an easy way