llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
31.62k stars 13.05k forks source link

[lldb-dap] Race condition in the event handler thread #131242

Open JDevlieghere opened 1 week ago

JDevlieghere commented 1 week ago

I discovered that TestDAP_breakpointEvents.py (macOS only) was failing nondeterministically. I tracked the problem down to the event handler thread (EventThreadFunction) where the check that a breakpoint has the DAP label would sometimes return false, even though the breakpoint in question was set through DAP.

I confirmed it's a race condition with the following assert:

bool is_dap = bp.MatchesName(BreakpointBase::GetBreakpointLabel());
assert(is_dap == bp.MatchesName(BreakpointBase::GetBreakpointLabel()));

The problem is that the operation of creating a breakpoint and adding the label are not atomic. Both SB API operations are protected by the target API lock. Here's what's going on:

  1. The main thread receives a request to create a breakpoint with setBreakpoints. This calls SBTarget::BreakpointCreateByLocation which acquires the target API lock, creates the breakpoint and releases it.
  2. The event thread receives an eBreakpointEventTypeLocationsResolved event and prepares to send a breakpoint event, which involves checking if the breakpoint has the DAP label. This calls SBBreakpoint::MatchesName which acquires the target API lock, doesn't find the label, and releases it.
  3. Meanwhile on the main thread, we're trying to add the label in dap::Breakpoint::SetBreakpoint. This calls SBBreakpoint::AddName which waits on the API lock that's currently held by the event thread checking if the label is set.

In other words, creating the breakpoint and adding the label are not atomic operations. Individually they're protected by the target API mutex, but that doesn't help us since they're separate calls.

I see two ways to solve this problem:

  1. Add a mutex to the DAP instance to synchronize between the main thread and the events thread.
  2. Expose the target API mutex as part of the SB API. It's a recursive mutex so there's no risk to locking it before calling SB APIs that are protected by it. I would want this to be an RAII object so I would not expose it through SWIG.

I'm leaning towards the second options because I think the use care here matches the idea behind the API lock.

llvmbot commented 1 week ago

@llvm/issue-subscribers-lldb-dap

Author: Jonas Devlieghere (JDevlieghere)

I discovered that `TestDAP_breakpointEvents.py` (macOS only) was failing nondeterministically. I tracked the problem down to the event handler thread (`EventThreadFunction`) where the check that a breakpoint has the DAP label would sometimes return false, even though the breakpoint in question was set through DAP. I confirmed it's a race condition with the following assert: ``` bool is_dap = bp.MatchesName(BreakpointBase::GetBreakpointLabel()); assert(is_dap == bp.MatchesName(BreakpointBase::GetBreakpointLabel())); ``` The problem is that the operation of creating a breakpoint and adding the label are not atomic. Both SB API operations are protected by the target API lock. Here's what's going on: 1. The main thread receives a request to create a breakpoint with `setBreakpoints`. This calls `SBTarget::BreakpointCreateByLocation` which acquires the target API lock, creates the breakpoint and releases it. 2. The event thread receives an `eBreakpointEventTypeLocationsResolved` event and prepares to send a `breakpoint` event, which involves checking if the breakpoint has the DAP label. This calls `SBBreakpoint::MatchesName` which acquires the target API lock, doesn't find the label, and releases it. 3. Meanwhile on the main thread, we're trying to add the label in `dap::Breakpoint::SetBreakpoint`. This calls `SBBreakpoint::AddName` which waits on the API lock that's currently held by the event thread checking if the label is set. In other words, creating the breakpoint and adding the label are not atomic operations. Individually they're protected by the target API mutex, but that doesn't help us since they're separate calls. I see two ways to solve this problem: 1. Add a mutex to the `DAP` instance to synchronize between the main thread and the events thread. 2. Expose the _target API mutex_ as part of the SB API. It's a recursive mutex so there's no risk to locking it before calling SB APIs that are protected by it. I would want this to be an RAII object so I would not expose it through SWIG. I'm leaning towards the second options because I think the use care here matches the idea behind the API lock.
llvmbot commented 1 week ago

@llvm/issue-subscribers-lldb

Author: Jonas Devlieghere (JDevlieghere)

I discovered that `TestDAP_breakpointEvents.py` (macOS only) was failing nondeterministically. I tracked the problem down to the event handler thread (`EventThreadFunction`) where the check that a breakpoint has the DAP label would sometimes return false, even though the breakpoint in question was set through DAP. I confirmed it's a race condition with the following assert: ``` bool is_dap = bp.MatchesName(BreakpointBase::GetBreakpointLabel()); assert(is_dap == bp.MatchesName(BreakpointBase::GetBreakpointLabel())); ``` The problem is that the operation of creating a breakpoint and adding the label are not atomic. Both SB API operations are protected by the target API lock. Here's what's going on: 1. The main thread receives a request to create a breakpoint with `setBreakpoints`. This calls `SBTarget::BreakpointCreateByLocation` which acquires the target API lock, creates the breakpoint and releases it. 2. The event thread receives an `eBreakpointEventTypeLocationsResolved` event and prepares to send a `breakpoint` event, which involves checking if the breakpoint has the DAP label. This calls `SBBreakpoint::MatchesName` which acquires the target API lock, doesn't find the label, and releases it. 3. Meanwhile on the main thread, we're trying to add the label in `dap::Breakpoint::SetBreakpoint`. This calls `SBBreakpoint::AddName` which waits on the API lock that's currently held by the event thread checking if the label is set. In other words, creating the breakpoint and adding the label are not atomic operations. Individually they're protected by the target API mutex, but that doesn't help us since they're separate calls. I see two ways to solve this problem: 1. Add a mutex to the `DAP` instance to synchronize between the main thread and the events thread. 2. Expose the _target API mutex_ as part of the SB API. It's a recursive mutex so there's no risk to locking it before calling SB APIs that are protected by it. I would want this to be an RAII object so I would not expose it through SWIG. I'm leaning towards the second options because I think the use care here matches the idea behind the API lock.
JDevlieghere commented 1 week ago

CC @ashgti and @vogelsgesang

jimingham commented 1 week ago

I like the second option as well.

The DAP is creating SB API like primitives composed of a series of SB API calls for some operations, and then it might want to use SB API's as the primitives directly for other operations. The only way for that to work coherently is for the DAP to be able to protect the composite operations using the same locking strategy the SB API calls use.

ashgti commented 1 week ago

This sounds like the same issue in #130863

I wonder if we actually need the label or if we could send other breakpoint events.

An adapter can send a breakpoint event with "reason": "new" or event "reason": "removed" to have the adapter create or remove breakpoints. The one caveat with the 'new' event is that shouldn't send it until there is a 'source' associated with the breakpoint (see https://github.com/microsoft/vscode/blob/d09fd2b5d78e10927937d2b24ee588a6c27b34eb/src/vs/workbench/contrib/debug/browser/debugSession.ts#L1240 for where VSCode creates the breakpoint if it needs to).

I started messing with that locally and have a partial update that removes the need for labels.

jimingham commented 1 week ago

That's a local solution, but I think it's just postponing the inevitable. Any time that a DAP primitive makes two SB API calls in one primitive, and you are allowing other threads to make SB API calls, you have no guarantee that the thing you were trying to do in the primitive was done before some other operation came along and changed state out from under you.

To make this actually stable, you either need to make sure that all the SB API work goes on on a single thread, or you have to be able to ensure that these composite operations get to finish before another thread can change state out from under them.

jimingham commented 1 week ago

It would be nice to represent breakpoints that the user set on the command-line or in their lldbinit files, etc, in the UI. But that's hard in general since you can make fairly odd breakpoint kinds (e.g. backed by Python resolvers with specific key/value arguments) so you have to be able to represent that somehow in the UI. All the breakpoint kinds are serializable, so persistence could be managed even for breakpoint kinds the UI doesn't understand. But how to present them is tricky.

I had some notions in the past of writing a DTD alongside the serialization form (which is JSON) that describes the various breakpoint kinds which dynamic UI's could use to build UI for breakpoint setting, but I haven't gotten around to that yet.

JDevlieghere commented 1 week ago

This sounds like the same issue in #130863

Ha, I really regret not seeing that issue earlier. You even tagged me. I need to check why this didn't show up in my inbox.

I wonder if we actually need the label or if we could send other breakpoint events.

I agree that'd be nice , but that's working around the issue. It's just a matter of time before we have another call in the event or progress thread that will run into this and as illustrated by both of our issues, it's not trivial to debug. We should figure out the synchronization regardless of whether we decide to keep using the label.