Common race condition is fixed by identifying that Client.respHandler can be completely removed since all respHandler operations (get, put and invocation) can be moved into the Client.processLoop goroutine, meaning that zero locking is required. This race condition resulted in a deadlock that was resolved by the response timeout at the end of client.Do, returning ErrLostConn
Rare race condition is fixed by changing responseHandlerMap.get to .getAndRemove. This race condition resulted in the innerHandler for a new dtJobCreated assigned in client.Do overriding a stale older dtJobCreated request, and the newer innerHandler being removed by an older dtJobCreated in client.processLoop > client.handleInner. When the newer dtJobCreated response was received, the handler for it had already been deleted. This was resolved by the response timeout at the end of client.Do, returning ErrLostConn
I have added examples/pl/worker_multi.pl to create a bunch of workers for a new integration test to ferret out these race conditions. In more depth:
First race condition occurs in this scenario:
Req A: client.Do called, execution pauses at end of client.Do waiting for dtJobCreated result
Req A: Server sends dtJobCreated, handle created
Req A: client.Do call completes. dtJobCompleted for Req A is yet to be sent
Req B: client.Do called, locks client.respHandler
Req A: Server sends dtWorkComplete for Req A. client.processLoop is unable to invoke the callback as respHandler is locked by another goroutine in step 5. Deadlock.
Req B: client.Do invokes client.do, a new request is written to the server however its innerHandler callback will never be called as client.processLoop is parked attempting to lock respHandler
Req B: client.do() times out after client.ResponseTimeout seconds, releasing the respHandler lock and returning ErrLostConn. This works around the deadlock.
Second race condition (much rarer) occurs in this scenario:
Req A: client.Do called, sends request
Req A: Server sends dtJobCreated, client.processLoop calls client.handleInner which gets the "c" callback, invokes it via h(resp), goroutine execution suspends before it removes it
Req A: client.Do call completes. All locks are relinquished
Req B: client.Do called, assigns a new innerHandler callback "c" which overrides Req A "c" callback (as it has not yet been removed in step 2)
Req A: Step 2 execution resumes, deleting the Req B "c" callback assuming it was Req A's.
Req B: Server sends dtJobCreated, h, ok := client.innerHandler.get(key); ok evaluates to false as the B's callback was wrongly deleted in step 5
Req B: client.do() times out after client.ResponseTimeout seconds returning ErrLostConn. This works around the race condition.
I have added
examples/pl/worker_multi.pl
to create a bunch of workers for a new integration test to ferret out these race conditions. In more depth:First race condition occurs in this scenario:
Second race condition (much rarer) occurs in this scenario:
"c"
callback, invokes it viah(resp)
, goroutine execution suspends before it removes it"c"
which overrides Req A"c"
callback (as it has not yet been removed in step 2)"c"
callback assuming it was Req A's.h, ok := client.innerHandler.get(key); ok
evaluates to false as the B's callback was wrongly deleted in step 5