Open tiran opened 8 years ago
Can we make PyUnicodeObject
truly immutable? As I understand it PyPy does just that.
No, it's not easily possible. AFAIK PyPy does not implement the trick where a single-reference PyUnicodeObject
is mutable as long as it has not escaped into Python space. CPython's PyUnicodeObject
is mutable in more ways. For instance each PyUnicodeObject
can hold multiple optional representations of its data, e.g. an additional UTF-8 representation. The case is explained in the paragraph writeable data members. We can't get rid of the additional members w/o a major rewrite, API breakage and performance decrease.
Although Python's str type is immutable from Python space, it is a C-mutable object.
PyUnicodeObject
has several writeable members. In factPyUnicodeObject
's payload itself is writeable from C code when conditionPy_REFCNT(o) == 1
. @larryhastings and I agree that a per-object lock for str is too costly. Instead we like to go with an optimistic global unicode lock.Disclaimer: I don't fully understand the details of the current implementation and PEP 393.
https://www.python.org/dev/peps/pep-0393/#specification
Py_hash_t hash
The
hash
members caches the hash value forhash('somestring')
. It is only computed on demand. Since hash doesn't involve any storage, no locking is required. At worst two threads compute the same hash value and override each other.writeable data members
wchar_t *wstr (PyASCIIObject)
char *utf8 (PyCompactUnicodeObject)
PyUnicodeObject.data
Write access to any and all C-mutable members, that involve memory allocation, must be synchronized by the GUL. Otherwise two threads may set the same pointer, which result in a memory leak of one of the allocated buffers. My gut feeling tells me that conflicts are scarce, so optimistic locking is going to perform better here.
utf8
member is already setutf8
member is not set, compute UTF-8 valueutf8
member in the mean time.utf8
member is still NULL, set memberutf8
member has been set by another thread, discard and free UTF-8 valuespecial casing of Py_REFCNT() == 1
Python's
str
uses a special case to optimize string concatenation and in_PyUnicodeWriter
. As far as I am able to figure out_PyUnicodeWriter
, it requires the special case to work. I'm not yet sure how to handle this special case. I have been considering a new flagconstructable
which can be set if-and-only-if aPyUnicodeObject
is in C API calls in a single thread.struct state
has unused 24 bits left.WIP branch
I have started a branch but gave up after a couple of hours, https://github.com/tiran/gilectomy/tree/gul