larryhastings / gilectomy

Gilectomy branch of CPython. Use "gilectomy" branch in git. Read the important, short README below!
Other
527 stars 43 forks source link

Rw locks in dicts #33

Open JustAMan opened 8 years ago

JustAMan commented 8 years ago

Introducing RW locks for unicode-keys-only dictionaries - they should be safe. These changes bring about 20% decrease of CPU time consumed by x.py.

This PR is related to #30

larryhastings commented 8 years ago

Thanks for doing this experiment!

I'm not convinced that this will make any difference in real-world Python programs. As we all know, x.py is a bad benchmark; all it does is

Are you willing to maintain this as a separate PR for a while, until the gilectomy branch can run a wider range of programs, and we can get a sense for how important this is in the general case?

larryhastings commented 8 years ago

Also, what will happen in your read/write lock if you read-lock the dictionary first then attempt a write-lock?

JustAMan commented 8 years ago

I'm not convinced that this will make any difference in real-world Python programs.

I've implemented quite a basic locking here, but I believe it should make a difference for any program that accesses essentially any unicode-keyed dictionaries - those include builtins, globals, **kw-style function arguments and even object attributes!

Also, what will happen in your read/write lock if you read-lock the dictionary first then attempt a write-lock?

Implementation idea itself is simple: I've added a "reader count" variable protected by separate lock, and when this reader count goes from 0 to 1 write lock is grabbed, and when it goes back to 0 write lock is released. So if some thread is executing a read-only operation dictionary is still write-locked, so if some other thread wants to change the dictionary it would have to wait on this write lock. The effect I was aiming for is actually removing the contention on write lock for threads doing the read-only stuff.

So that's the reason why I did not change "locking protocol" (tp_lock/tp_unlock attributes) - they are left as general, write-level locks.

Are you willing to maintain this as a separate PR for a while, until the gilectomy branch can run a wider range of programs, and we can get a sense for how important this is in the general case?

I think I will be able to keep this up-to-date with other development - I don't think dictobject would be changed a lot, would it? :)

larryhastings commented 8 years ago

I've implemented quite a basic locking here, but I believe it should make a difference for any program that accesses essentially any unicode-keyed dictionaries - those include builtins, globals, **kw-style function arguments and even object attributes!

Of course it does! But is that difference positive or negative? ;-)

In your version, dicts are now 8 bytes bigger (reader lock and reader count), and you've added some extra code including if statements. This isn't much overhead, but per-dict and per-dict-operation it adds up in a Python interpreter.

My gut instinct is that contention is low for read operations on nearly all dicts in the general case. So my guess is this isn't worth the tradeoff. But that's why I want to see this tested against more than x.py.

(True story: the original lock in dictobject.c was a read-write lock. But I changed it to a simpler lock, in part because of this gut instinct above.)

As far as my question goes, I meant: what if thread A grabs a read lock on dict D, then while still holding that read lock, attempts to grab a write lock on dict D? Because I bet I could make that happen from user code. And remember my mantra: "any time there is behavior in Python, somebody depends on it."

JustAMan commented 8 years ago

As far as my question goes, I meant: what if thread A grabs a read lock on dict D, then while still holding that read lock, attempts to grab a write lock on dict D?

That should not be possible from what I got reading the source... as it is the exact reason I restricted RW lock to unicode-keyed dicts only, so that I can be sure that doing a lookup in such dictionary does not execute some random bytecode, only unicode comparison function which essentially compares memory byte by byte. If you can make the situation above happen then this might hang :) But I don't see how you can make it happen...

My gut instinct is that contention is low for read operations on nearly all dicts in the general case. So my guess is this isn't worth the tradeoff. But that's why I want to see this tested against more than x.py.

I fully agree that we'd need some more benchmarks as x.py is not very representative... so I'll try to keep up with gilectomy branch up to the point we can run something wider, like Grand Unified benchmark or something.

larryhastings commented 8 years ago

I've been contemplating this patch for a while now. Today I did another thorough review. I think the patch isn't safe--and I'm starting to worry that the current dict implementation isn't safe, either. The problem occurs when a dict transitions from one "kind" to another--that is, from one lookup function to another.

Let's say we have a dict with only Unicode strings in it. For the sake of argument we'll declare that the current function is lookdict_unicode. At the moment there are no threads interacting with this dict.

  1. Thread 1 decides to do a SetItem on the dict, setting a non-Unicode key. It locks the dict (ma_lock).
  2. While the dict is still locked, Thread 2 decides to do a GetItem on the dict. It calls into the current lookup function, lookdict_unicode. This calls dict_lock_readonly. In your current implementation, this will lock ma_readerlock, then wait on ma_lock.
  3. Thread 1 then finishes its insert and releases ma_lock.
  4. Thread 2 now acquires ma_lock. dict_lock_readonly finishes, and lookdict_unicode proceeds to examine the dict, even though the dict is no longer a Unicode-only dict. If it encounters the non-Unicode key as it probes the dict, it will very likely crash.

This also made me realize that ma_keys->dk_lookup is mutable state, and might need some synchronization around it.

JustAMan commented 8 years ago

Indeed... Though this particular unsafety is easy to fix, I think - we just need to check again for lookup function change after we have taken ma_lock. I guess we'd need to check it only in Unicode version of lookup as I think a dict typically doesn't transit from non-string to string lookup except on copying...

joshring commented 7 years ago

This seems like a good idea in principle, have you had any luck making it threadsafe?

joshring commented 7 years ago

How about a more radical approach, where the object is not locked at all for read only access.

Objects are only locked if a thread attempts to write to the object. This would speed up far more than the reported 20% and ensure correct behaviour.

This would give two states: Read-only, non-locking as no writes were attempted (default, very fast) Write/Read, on an object. (slower due to locking)

This means that lookup can be made very much faster and if writes can be buffered we can get very efficient run-time performance.

I was wondering if the locking could be made into groups of elements to improve efficiency? Probably the only way to do that would be divide a list into smaller lists with the associated storage overhead incurred from doing that. I am assuming however that if you have a long enough list that could become much better scaling, rather than globally locking the whole thing. It also means that if a smaller section of the list is changed that not every part would need to be reallocated. So that would greatly improve both list write speed, lookup speed and also locking performance with more cores and longer lists, at the cost of memory overhead. Assuming the number of blocks divided into equals the number of CPUs this overhead could be fairly minimal and still give a significant boost to performance.

JustAMan commented 7 years ago

You can not omit locking completely, you still need some synchronization mechanism. To give you a hint please answer yourself a question - how would your code behave if there are two threads - one reading and one writing - and reading thread started the read first and then write thread came and changed something? If you don't do any synchronization your reading thread might crash or read garbage or work - depending on how lucky are you. Not a good symptom for any program, but especially for core features of runtime, eh?..

joshring commented 7 years ago

First situation: Two concurrent threads access an element almost simultaneously, one writes to an atomic bool to lock the data then proceeds to write to the data. The reading thread does an Atomic load of the bool check if it is unlocked, and waits until the data is unlocked to read.

Second situation: Two concurrent threads access an element almost simultaneously, one gets 1/2 way through reading and the other thread initialises the lock (torn write). The lock on the reading thread is checked again at the end of the write (cheap) and it has been found to have changed. The read value is discarded and the read waits for the writing lock before attempting again to read. This occurs until a valid read (defined by no change in the lock detected).

We can check for no change of state by incrementing an integer when locking and unlocking (odd numbers are unlocked, even numbers are locked) so a thread checks to see the number unchanged before accepting a read value.

Third situation: Two concurrent threads access an element and do not change it's state, no locks are used.

On Wed, Feb 22, 2017 at 2:41 PM, Vasily notifications@github.com wrote:

You can not omit locking completely, you still need some synchronization mechanism. To give you a hint please answer yourself a question - how would your code behave if there are two threads - one reading and one writing - and reading thread started the read first and then write thread came and changed something? If you don't do any synchronization your reading thread might crash or read garbage or work - depending on how lucky are you. Not a good symptom for any program, but especially for core features of runtime, eh?..

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/larryhastings/gilectomy/pull/33#issuecomment-281687856, or mute the thread https://github.com/notifications/unsubscribe-auth/ADyWODxI6KMoOkOgJGVrRjZzG6UuNmzYks5rfEkhgaJpZM4IzHEo .

joshring commented 7 years ago

A very interesting article about how to implement a faster solution to the RW-lock problem we're having. https://locklessinc.com/articles/locks/

Note especially the lowest solution.

A public domain (as in, no explicit licence) code: (linked to in the from the above's comments) https://github.com/wiredtiger/wiredtiger/blob/develop/src/support/mtx_rw.c

joshring commented 7 years ago

spinlock_with_backoff_incremented75cycles

Crude prototype in C implementing a "dumb spinlock" with a "back-off" for threads querying locks, which starts with 1 CPU cycle then increases, if data is locked, in 75 cycle increments. These requests can reduce memory bandwidth, which strangles multi-core chips. 75 cycles seems reasonable given the cost of these queries, and the latencies involved in updating the atomic variables which allow access to the variable.

An order of magnitude justification for ~75 cycles was found here: http://stackoverflow.com/questions/4087280/approximate-cost-to-access-various-caches-and-main-memory

This was done on a 4 core i5 4790k, a chip without HT, using pthreads for threading. Surprisingly the single threaded case with the lock was outperforming the case without. Quite unexpectedly.

Unsure if I try it, here's the prototype: https://gist.github.com/joshring/b321fa162d19f8006c0c5d5395965f0b

joshring commented 7 years ago

I have created a prototype gist with "lock-free" reading, which checks the state did not change during the read else re-read it. Writes are locking.

Performance for reading is higher with more cores you throw at it, writing has the opposite trend.

https://gist.github.com/joshring/5c9ebe0654085b68dcc1fd03bc1c1c2e

writing_and_reading

JustAMan commented 7 years ago

To me it seems that what you're describing looks like a simple versioning scheme. Keeping an forever-atomically-increasing version field seems better for me (no proofs, just some gut feeling).

Feel free to hack on my approach - I think what I'm missing in this PR right now is guarding against lookup field change as noted by @larryhastings (see above), otherwise it should be sane enough.

joshring commented 7 years ago

Atomically increasing is fine, but the performance of atomic operations is really horrible. So if we can avoid them in a "versioning scheme" it drastically improves performance. Even in C factors of >50x are typical for reading, as much as 140x I have seen.

joshring commented 7 years ago

To me it seems that what you're describing looks like a simple versioning scheme. Keeping an forever-atomically-increasing version field seems better for me (no proofs, just some gut feeling).

Challenge accepted: Seems I was using a "Seqlock" (like approach), https://en.wikipedia.org/wiki/Seqlock

This is a worst-case benchmark of a highly contested lock, from which we wish to read and write with multiple threads. The "Seqlock" has very cheap reads so it scales very well with threads even with writes.

The standard openMP lock was used in two configurations, with the read inside the writing lock and with the read in a separate lock. In both cases the seqlock with the read outside of the lock was able to outperform the openMP lock.

writing_and_reading_11_3_17

The code has now been significantly cleaned up and improved: Code: https://gist.github.com/joshring/b1b42841f14abe76999f448348256638

larryhastings commented 7 years ago

From the Wikipedia page on seqlocks:

"One subtle issue of using seqlocks for a time counter is that it is impossible to step through it with a debugger. The retry logic will trigger all the time because the debugger is slow enough to make the read race occur always."

Uh, I don't want to make it impossible to debug race conditions in Python.

"It should also be noted that the technique will not work for data that contains pointers, [...]"

Dict objects contain pointers.

joshring commented 7 years ago

Debugging appears to work just fine in this implementation.

**EDIT: I only meant it was similar to the Seqlock, but it appears different internally.

Assuming the pointers are updated atomically it should be OK do this with any lock we choose.

On Sat, Mar 11, 2017 at 4:33 PM, Josh Ring joshringuk@gmail.com wrote:

In the general locking case this could still be a problem? Examine the following

Lock set Thread1 reads pointerA Lock unset

Lock set Thread2 writes pointerA (Thread1's copy is atomically updated) Lock unset

The pointer Thread1 has read is invalid while the update is occurring, potentially invalidating the pointer at least temporarily. I think the problem of pointers is a very general one indeed in regards to multi-threaded locking code. What would you suggest to avoid this issue with pointers in a multi-threaded environment? You can serialise the access to dicts, but that will be significantly slower than single core execution.

I am not sure about the debugging comment, since it appears to be specific to the Linux kernel's "time counter" implementation, I will do some experimentation. Race conditions are known to be difficult to detect generally regardless of locking technique however, so there is no "silver bullet" here.

On Sat, Mar 11, 2017 at 1:48 PM, larryhastings notifications@github.com wrote:

From the Wikipedia page on seqlocks:

"One subtle issue of using seqlocks for a time counter is that it is impossible to step through it with a debugger. The retry logic will trigger all the time because the debugger is slow enough to make the read race occur always."

Uh, I don't want to make it impossible to debug race conditions in Python.

"It should also be noted that the technique will not work for data that contains pointers, [...]"

Dict objects contain pointers.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/larryhastings/gilectomy/pull/33#issuecomment-285867783, or mute the thread https://github.com/notifications/unsubscribe-auth/ADyWONKH_omdmM95vqJx0a3vSl4egTx7ks5rkqY7gaJpZM4IzHEo .