larryhastings / gilectomy

Gilectomy branch of CPython. Use "gilectomy" branch in git. Read the important, short README below!
Other
527 stars 43 forks source link

[idea] Thread-local incr/decr #28

Open pepijndevos opened 8 years ago

pepijndevos commented 8 years ago

I just watched your PyCon talk about this project, and I had an idea to reduce cache misses due to atomic incr/decr.

What if you introduced a thread-local refcount and a thread-global thread-refcount?

The idea would be that the number of threads that access an object rarely changes. So if a thread wants to change the refcount it can do so locally and quickly, and only when its refcount drops to zero does it need to do an atomic decr on the thread-refcount and free if it's 0. If it's non-zero it's the job of the remaining threads to clean up once their local refcount drops to 0.

As you mentioned, most objects are only ever used in one thread. So in addition to the above concept, which would still require 2 atomic operations at creation and destruction, the thread id of the object creator could be stored so that touching the thread-global counter can be deferred until another thread incr's the object, in which case it'd need to be set to 2.

[edit] Actually local storage in C works nothing like threading.local which is implemented using a dict. That makes it a lot slower I guess. There is probably a lot more I overlooked.

JustAMan commented 8 years ago

I think someone at the sprints was already working on this, and @larryhastings was working on making buffered refcounts (again to rid of atomics, but in another way).

Just to increase the priority of this issue - I have made some profiling (running x.py benchmark), and it looks like roughly 66% of time spent in PyEval_EvalFrameEx itself (excluding its callees) is spent in atomic inc- or decrefs! If that could be made zero that would speed up stuff greatly.

Same profiling shows that atomic refcounting takes at least as much percentage from call_function, fast_function, PyFrame_New and frame_dealloc (for some of those atomic operations take more than 90% of time!).

JustAMan commented 8 years ago

@larryhastings - what's going on about reworking refcounts? This seems to be the biggest slowdown so far...

mysteryjeans commented 8 years ago

I have came here to post this idea and seems validate as concept.

Just to be clear, when going to incre objrefcount on local threaf first if its zero, if so first do atomic incre of threadrefcount on object. Similarly when did decr objrefcount on local thread then check if its become zero? If so do atomic decr threadrefcount and than also check threadrefcount be comes zero & called destructor in same theard.