rprichard commented 3 years ago


Bugzilla Link	50935
Version	12.0
OS	Linux
CC	@JDevlieghere,@jimingham

Extended Description

With a dynamically-linked musl executable (on x86_64 gLinux aka Debian), LLDB is unable to step over a "printf" call. The problem happens with either PIE or non-PIE executables, but not with statically-linked musl executables. LLDB is able to step over this call if I use glibc instead (with either the Debian gcc or clang compiler).

I tested with both LLDB 12.0.0 and with a current LLDB build from origin/main.

$ cat >test.c <<EOF

include

int main(void) { printf("hello!\n"); return 0; } EOF

$ musl-gcc test.c -g $ file a.out a.out: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib/ld-musl-x86_64.so.1, with debug_info, not stripped

$ /x/clang12/bin/lldb a.out (lldb) target create "a.out" Current executable set to '/x/mess/a.out' (x86_64). (lldb) b main Breakpoint 1: where = a.out`main + 4 at test.c:3:3, address = 0x0000000000001159 (lldb) run Process 235162 launched: '/x/mess/a.out' (x86_64) Process 235162 stopped

thread #1, name = 'a.out', stop reason = breakpoint 1.1 frame #0: 0x0000555555555159 a.out`main at test.c:3:3 1 #include 2 int main(void) { -> 3 printf("hello!\n"); 4 return 0; 5 } (lldb) n Process 235162 stopped
thread #1, name = 'a.out', stop reason = step over failed (Could not create return address breakpoint.) frame #0: 0x0000555555555020 a.outputs a.outputs: -> 0x555555555020 <+0>: jmpq *0x2ff2(%rip) ; _GLOBAL_OFFSETTABLE + 24 0x555555555026 <+6>: pushq $0x0 0x55555555502b <+11>: jmp 0x555555555010

a.out`__libc_start_main: 0x555555555030 <+0>: jmpq *0x2fea(%rip) ; _GLOBAL_OFFSETTABLE + 32

$ /x/clang12/bin/lldb --version lldb version 12.0.0 (https://github.com/llvm/llvm-project/ revision b978a93635b584db380274d7c8963c73989944a1) clang revision b978a93635b584db380274d7c8963c73989944a1 llvm revision b978a93635b584db380274d7c8963c73989944a1

$ musl-gcc --version cc (Debian 10.2.1-6+build2) 10.2.1 20210110 Copyright (C) 2020 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ dpkg -l musl ... ii musl:amd64 1.2.2-1 amd64 standard C library

rprichard commented 3 years ago

Yeah, I may have some time to dig into this more.

I also noticed that SectionLoadList::ResolveLoadAddress is unable to lookup an address in musl's libc.so [...]

Filed as llvm/llvm-bugzilla-archive#51466 .

jimingham commented 3 years ago

Yeah, I can upload my patch that allows for running to a breakpoint in more situations.

In the "step-off-breakpoint" error, stepping here is not going to go well if we can't unwind from the stub.

In this case, LLDB sets a breakpoint after the call instruction, then steps into the called function. It seems(?) like it could simply leave the breakpoint in place and resume, rather than clear the post-call breakpoint and try to unwind (which if successful would simply put a breakpoint at the same post-call PC).

That's quite likely. The optimization to treat calls specially is relatively new. For the longest time, we always ran to the next "branch" instruction no matter the kind, and then did the "step in/step out" trick. We made no assumptions about the character of the branch we were at. The special handling of "call" instructions was recent. BTW, by "call" instruction, we just mean an instruction that is guaranteed (except for exceptions) to return to the next instruction following it.

So it's not surprising that this opportunity to simplify the algorithm hasn't been taken yet. Again, if you feel like digging in, feel encouraged to do so, and I'll happily review any patches.

My plate is pretty full, so left to me, this may take a while.

I'm still unfamiliar with the "plan" stuff -- I'll probably study it a bit more.

I debugged unwinding a bit. LLDB is able to unwind from the executable .plt's back to main, and then from main into libc_start_main_stage2. However, I think it doesn't find unwind info for libc_start_main_stage2, so it can't establish a certain CFA value, then it discards both the frame for main and libc_start_main_stage2. I need to study it more closely, but I think LLDB could do better. Maybe musl needs an "end of the stack" annotation somehow?

Maybe a separate bug about this might be a good way to go. I'm not an expert on the unwinder, and this bug is starting to become a portmanteau bug...

I also noticed that SectionLoadList::ResolveLoadAddress is unable to lookup an address in musl's libc.so (e.g. "disas -n printf" works but "disas -a
" doesn't). I don't think it's quite related, so I'll file a separate bug for it. I suspect the "TODO: remove this once we either fix library matching or avoid" in DynamicLoaderPOSIXDYLD.cpp special case is broken.

Thanks!

rprichard commented 3 years ago

Yeah, I can upload my patch that allows for running to a breakpoint in more situations.

In the "step-off-breakpoint" error, stepping here is not going to go well if we can't unwind from the stub.

In this case, LLDB sets a breakpoint after the call instruction, then steps into the called function. It seems(?) like it could simply leave the breakpoint in place and resume, rather than clear the post-call breakpoint and try to unwind (which if successful would simply put a breakpoint at the same post-call PC).

I'm still unfamiliar with the "plan" stuff -- I'll probably study it a bit more.

I debugged unwinding a bit. LLDB is able to unwind from the executable .plt's back to main, and then from main into libc_start_main_stage2. However, I think it doesn't find unwind info for libc_start_main_stage2, so it can't establish a certain CFA value, then it discards both the frame for main and libc_start_main_stage2. I need to study it more closely, but I think LLDB could do better. Maybe musl needs an "end of the stack" annotation somehow?

I also noticed that SectionLoadList::ResolveLoadAddress is unable to lookup an address in musl's libc.so (e.g. "disas -n printf" works but "disas -a " doesn't). I don't think it's quite related, so I'll file a separate bug for it. I suspect the "TODO: remove this once we either fix library matching or avoid" in DynamicLoaderPOSIXDYLD.cpp special case is broken.

jimingham commented 3 years ago

We've talked at times about having a different "I'm lost in a new stack frame" response than to just stop. We could, for instance, just keep single stepping and hope we get somewhere better. Maybe single step till the function above us shows up again in the backtrace, and then try to set our return breakpoint.

But for right now the simplest thing is "if confused, stop!"

jimingham commented 3 years ago

If you have a fix for the logic to make us step past the call instruction rather than stepping in, please put that up for review. That seems useful, though TTTT the optimization of "not doing step-in & step-out" is a best effort thing. We still expect that as a last resort step-in and step-back-out should work.

In the "step-off-breakpoint" error, stepping here is not going to go well if we can't unwind from the stub. We can't really do anything sensible, so we should just stop. The bug here is that, after the step over line plan was popped off the stack, we end up saying:

intern-state ThreadList::ShouldStop overall should_stop = 0

and then continuing.

Since the ThreadPlanStepRange plan was done & popped off the stack, we would next ask the base thread plan whether to stop in: ThreadPlanBase::ShouldStop. That's a moderately complex function and there's not enough info in your log to guess why it is deciding to resume. Somebody will need to debug that live.

rprichard commented 3 years ago

lldb-step-off-breakpoint.txt

rprichard commented 3 years ago

apple-step-into-example.txt

rprichard commented 3 years ago

When I use musl-gcc, LLDB isn't able to unwind either from the executable's .plt or from musl's libc.so, whereas that does work if I use the gcc driver (for using glibc). At least for the executable's .plt, LLDB identifies two ways to unwind, "assembly insn profiling" and "eh_frame CFI", but of course it doesn't trust eh_frame to be valid at every PC, so maybe that's affecting something. I need to debug it some more.

Even on macOS, though, I see the same behavior where LLDB steps over a call by stepping into it and back out. I'll attach an example.

I'm suspicious of a couple of comparisons in LLDB's ThreadPlanStepRange::SetNextBranchBreakpoint that seem off-by-two and off-by-one:

https://github.com/llvm/llvm-project/blob/llvmorg-13.0.0-rc1/lldb/source/Target/ThreadPlanStepRange.cpp#L334-L347

// If we didn't find a branch, run to the end of the range. if (branch_index == UINT32_MAX) { uint32_t last_index = instructions->GetSize() - 1; if (last_index - pc_index > 1) { ... } } else if (branch_index - pc_index > 1) {

The (last_index - pc_index > 1) comparison requires that the list of instructions have at least 3 remaining instructions to execute, whereas intuitively I'd expect 1 instruction to be enough? i.e. The if statement can be removed.

The (branch_index - pc_index > 1) comparison requires that there be at least 2 instructions between the PC and the branch instruction, but intuitively 1 ought to be enough. The comparison could become (branch_index > pc_index).

e.g. From my attached apple-step-into-example.txt, the assembly is:

(lldb) disas a.out`main: 0x100003f7a <+0>: pushq %rbp 0x100003f7b <+1>: movq %rsp, %rbp -> 0x100003f7e <+4>: leaq 0x29(%rip), %rdi ; "hello!" 0x100003f85 <+11>: callq 0x100003f8e ; symbol stub for: puts 0x100003f8a <+16>: xorl %eax, %eax 0x100003f8c <+18>: popq %rbp 0x100003f8d <+19>: retq

I think the list of instructions would have the leaq and callq instructions, pc_index would be 0, and there is no branch at the end. last_index is 1 (referring to the callq instruction). We want to set a breakpoint on the xorl instruction and run to it, but ThreadPlanStepRange::SetNextBranchBreakpoint doesn't set a breakpoint, and LLDB seems to single-step one instruction at-a-time. (I'm not sure how that works yet...)

$ cat apple-step-into-example.txt | grep reached ThreadPlanStepOverRange reached 0x0000000100003f85. ThreadPlanStepOverRange reached 0x0000000100003f8e. ThreadPlanStepOverRange reached 0x0000000100003f8a.

This commit is relevant:

https://github.com/llvm/llvm-project/commit/a3f466b9e785ca8f6712904e408bda31c79ca1b0

"Fix commit 252963 to work around a bug on some platforms where they don't correctly handle stepping over one breakpoint directly onto another breakpoint.
This isn't fixing that bug, but rather just changing 252963 to not use breakpoints if it is only stepping one instruction."

Stepping one instruction isn't generally sufficient to reach the branch (or the end of instruction list), though:

If branch_index is UINT32_MAX: pc_index could point to the last instruction, which is a call that we want to step over. In that case, the breakpoint would be one instruction further (but multiple instructions run).
If branch_index is a real branch: pc_index could point to a call instruction just before the branch.

I experimented with the above changes to ThreadPlanStepRange::SetNextBranchBreakpoint, and it fixes this bug, but reveals another with musl-gcc:

Set a breakpoint on a call instruction to printf (or puts).
Run to the breakpoint.
Try to step over (n) the call instruction.
The program doesn't stop until it exits.

I think(?) LLDB is required to single-step off the breakpoint, which puts the PC in the printf linker stub. From there, I guess LLDB can't unwind so it doesn't know what to do? But in principle, it could still work, because it could just run to the address past the call. That doesn't happen, though. I'll attach it in case it's interesting, lldb-step-off-breakpoint.txt. That file has a few interesting lines:

(lldb) disas a.out`main: 0x555555555155 <+0>: pushq %rbp 0x555555555156 <+1>: movq %rsp, %rbp 0x555555555159 <+4>: leaq 0xea0(%rip), %rdi -> 0x555555555160 <+11>: callq 0x555555555020 ; symbol stub for: ___lldb_unnamed_symbol61 0x555555555165 <+16>: movl $0x0, %eax 0x55555555516a <+21>: popq %rbp 0x55555555516b <+22>: retq
(lldb) log enable lldb step (lldb) n lldb Thread::PushPlan(0x0x1103bc0): "Stepping over line hello.c:4:3.", tid = 0x187bf. lldb ThreadPlanStepRange::SetNextBranchBreakpoint - Setting breakpoint -2 (site 4) to run to address 0x555555555165 lldb Process::PrivateResume() m_stop_id = 5, public state: stopped private state: stopped lldb Thread::PushPlan(0x0x1103bc0): "Single stepping past breakpoint site 3 at 0x555555555160", tid = 0x187bf. ... intern-state ThreadPlanStepOverRange reached 0x0000555555555020. intern-state Removing next branch breakpoint: -2. intern-state Stepping out of frame with no debug info ...

jimingham commented 3 years ago

I see three stages of the process that seem not to be working on your system.

1) When lldb is "nexting over a function" and sees that it is at a "call" instruction (i.e. one that is guaranteed to return to the next instruction), it won't try to step in and then step back out of the function, it will instead put a breakpoint on the instruction after the call and run to there.

We use the return from InstructionLLVMC::IsCall for the stub dispatch instruction to decide whether to take this shortcut or not. If that were returning true, we wouldn't have stepped into printf and had to step back out again. Of course if you are using a more general instruction to dispatch to the cross-library stub that isn't a call then this is not relevant. We will have to step in and then back out to follow the control flow.

2) When we do a step-in and land in shared library trampoline code, we usually step "through" the trampoline to the target function automatically. That doesn't seem to be working, since we stop in the stub in your a.out, not in the target of the stub. Stepping through cross-library stubs is handled by the GetStepThroughTrampolinePlan call in the DynamicLoader plugin for your system.

It looks like that isn't working in your case. Note, if your dynamic loader behavior is different from the standard Linux loader behavior, you will have to write a DynamicLoader plugin that describes how it works..

3) Finally, when we try to step out of a frame, we ask the unwinder for the return pc from the frame above us. It looks like either we don't know how to unwind from the cross-library stub, and so we can't find the previous frame or we can't find the return PC in that frame. There's something about being stopped at a cross-library stub that's fooling the unwinder.

These are all failures specific to support for your system, it's the plugin-points that are failing so far as I can see, not the basic algorithms. Somebody who has access to this system and can debug these problems will need to have a look.

llvm / llvm-project

lldb step over failed in musl dyn exe (Could not create return address breakpoint) #50279

Extended Description

include