larryhastings / gilectomy

Gilectomy branch of CPython. Use "gilectomy" branch in git. Read the important, short README below!
Other
527 stars 43 forks source link

GUL -- Global Unicode Lock #13

Open tiran opened 8 years ago

tiran commented 8 years ago

Although Python's str type is immutable from Python space, it is a C-mutable object. PyUnicodeObject has several writeable members. In fact PyUnicodeObject's payload itself is writeable from C code when condition Py_REFCNT(o) == 1. @larryhastings and I agree that a per-object lock for str is too costly. Instead we like to go with an optimistic global unicode lock.

Disclaimer: I don't fully understand the details of the current implementation and PEP 393.

https://www.python.org/dev/peps/pep-0393/#specification

Py_hash_t hash

The hash members caches the hash value for hash('somestring'). It is only computed on demand. Since hash doesn't involve any storage, no locking is required. At worst two threads compute the same hash value and override each other.

writeable data members

Write access to any and all C-mutable members, that involve memory allocation, must be synchronized by the GUL. Otherwise two threads may set the same pointer, which result in a memory leak of one of the allocated buffers. My gut feeling tells me that conflicts are scarce, so optimistic locking is going to perform better here.

Python's str uses a special case to optimize string concatenation and in _PyUnicodeWriter. As far as I am able to figure out _PyUnicodeWriter, it requires the special case to work. I'm not yet sure how to handle this special case. I have been considering a new flag constructable which can be set if-and-only-if a PyUnicodeObject is in C API calls in a single thread. struct state has unused 24 bits left.

WIP branch

I have started a branch but gave up after a couple of hours, https://github.com/tiran/gilectomy/tree/gul

DemiMarie commented 7 years ago

Can we make PyUnicodeObject truly immutable? As I understand it PyPy does just that.

tiran commented 7 years ago

No, it's not easily possible. AFAIK PyPy does not implement the trick where a single-reference PyUnicodeObject is mutable as long as it has not escaped into Python space. CPython's PyUnicodeObject is mutable in more ways. For instance each PyUnicodeObject can hold multiple optional representations of its data, e.g. an additional UTF-8 representation. The case is explained in the paragraph writeable data members. We can't get rid of the additional members w/o a major rewrite, API breakage and performance decrease.