Fix a deadlock issue using search asynchronously

go-ldap / ldap

Basic LDAP v3 functionality for the GO programming language.

Other

2.25k stars 355 forks source link

Fix a deadlock issue using search asynchronously #446

Closed t2y closed 1 year ago

t2y commented 1 year ago

The previous change is merged from #440.

I'm implementing #422 using the search asynchronous feature. I found a deadlock issue with the low-sized buffered channel in my local environment. When the buffered channel is full, a goroutine waits to send to the channel. But sometimes, searchResponse.Next() occurs deadlock since the messageMutex cannot get a lock in Conn.GetLastError().

While working at #440, I found Conn.err is a race condition. So I added messageMutex.Lock() code, which causes a deadlock issue with certain conditions (e.g., a low-sized buffered channel). That makes a more complicated problem.

~This code is young, and we can revert quickly.~

cpuschma commented 1 year ago

I can't follow your thoughts. Removing the mutex check from GetLastError is not ideal in my opinion. Can you explain/do you know what caused the messageMutex to hang? Because a slow reader and therefore a blocked buffered channel is something we cannot account for.

t2y commented 1 year ago

@cpuschma I changed GetLastError() method to lock for searchResponse.Next() in #440. See below.

https://github.com/go-ldap/ldap/pull/440/commits/9de41ced83456b5370af3765bf416db47052328f

So I just revert my previous changes since searchResponse.Next() doesn't handle it.

t2y commented 1 year ago

Another problem. This failure is rarely a race condition, even if the messageMutex locks. I guess it doesn't reproduce after running again.

https://github.com/go-ldap/ldap/actions/runs/5599882841/jobs/10241454178?pr=447

cpuschma commented 1 year ago

We cannot guarantee that the the Connection.err is guarded atomically without a mutex. Again, that's not an option.

I still don't see what the exact problem is, maybe because of the language barrier. Lets break things down:

You establish a connection to the directory server
You submit a search request with a buffered channel of N. In the internals, the library locks the mutex, encodes and sends the message and unlocks the mutex
The incoming response are read in a seperate goroutine, which also shortly utilizes the mutex
The entries are send to the channel one by one for a consumer to process it. No locking of the messageMutex

What am I missing here? :sweat:

t2y commented 1 year ago

Connection.err is guarded atomically without a mutex. Again, that's not an option.

I see. I agree with using mutex to get Connection.err. But the previous code didn't have mutex before I worked #440. Do you think it still needs mutex?

After #440, two goroutine access to Connection.err, and rarely it occurs a data race condition (see above CI failure), and sometimes it occurs a deadlock issue. Because searchResponse.Next() access to Connection.err. As the same as you, I thought no problem when I was working on #440, but it's difficult to imagine since it's a timing issue.

You can probably reproduce easily. This is how to reproduce in my environment.

set 1 to a buffered channel
call SearchAsync() to get about a thousand entries
run several times

Additionally, searchResponse.Next() doesn't need to access Connection.err. At that time, I just thought it was a shortcut to get an error. If this code does not exist, the error occurs in the search process next time.

cpuschma commented 1 year ago

Agreed @t2y :+1:

Admittedly, the missing mutex for GetLastError was an oversight of me. At the time it wasn't designed for parallelism like SearchAsync in mind. Can you adjust your branch to keep the mutex but remove the GetLastError calls from the Next method?

t2y commented 1 year ago

@cpuschma Thanks for your review. I follow you.