rr-debugger / rr

Record and Replay Framework
http://rr-project.org/
Other
9.09k stars 578 forks source link

Support ARM #1373

Closed khuey closed 3 years ago

khuey commented 9 years ago

Should file this since I'm working on it.

https://github.com/khuey/rr/compare/arm

Requires a couple kernel patches at the moment too.

ignoramous commented 9 years ago

Would this mean that rr could be used to debug android applications?

rocallahan commented 9 years ago

Eventually yes, but that's a lot of extra work beyond the ARM support Kyle is working on.

MagaTailor commented 8 years ago

Any news on ARM support?

rocallahan commented 8 years ago

ARM support is not happening in the forseeable future. We discovered a critical technical issue: ARM processors implement atomic operations using a load-linked/store-conditional pair of instructions, and those operations can fail nondeterministically (from rr's point of view; failures depend on cache state and whether a hardware interrupt occurs between the instructions). So we don't have a performance counter that is deterministic enough for rr to use under those conditions.

To fix this, we'd have to modify rr's design philosophy and instrument all ARM code, perhaps using DynamoRio or something like that. That's a lot of work and the cost/benefit for Mozilla doesn't seem to be there right now.

If you really need rr-like functionality for ARM and Android, I recommend buying UndoDB from Undo Software.

matt2909 commented 8 years ago

That is a strange comment, can you explain what is lacking with an approach such as the linux kernel takes for atomic operations:

http://lxr.free-electrons.com/source/arch/arm/include/asm/atomic.h#L41

rocallahan commented 8 years ago

There is no difficulty implementing atomic operations. The problem is that they can disturb the performance counters.

For example, suppose we're using the number of retired instructions, measured via HW performance counters, as our progress counter. Suppose we record a simple program that just does an atomic increment using the code sequence you referenced. Suppose that the LL/SC pair succeeds the first time and we record N instructions executed. Now suppose we replay the execution but this time, a hardware interrupt occurs between the ldrex and the strex instructions, forcing the strex to fail and the code to execute another iteration of the loop. The program completes but performance counters report that we have executed N+4 instructions.

This effect means that performance counters are not 100% reliable for our purposes, which makes rr's zero-instrumentation approach infeasible.

Keno commented 7 years ago

I've been reading more about ARM performance counters. It seems that at least the newer ARM chips can count failed strex instructions. I wonder whether setting ticks to branches taken - failed strex would be consistent enough for our purposes (It's of course possible for there to be branches in the ll/sc pair, but I don't know how common that is in the real world).

rocallahan commented 7 years ago

Mmm. Reference?

khuey commented 7 years ago

From my notes (and you should double check this, because they're from over a year ago), there were two issues.

  1. Cortex A17 counter value 0x63 claims to count "Exclusive instruction speculatively executed - STREX fail." That speculative part is a killer.
  2. The Cortex A17 removed the architecturally executed branch counter. The only counter of architectural executions remaining is the instructions retired counter.
Keno commented 7 years ago

I was looking at the Cortex-A9 docs, http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388g/BEHDIGBF.html, which has

STREX failed
Counts the number of STREX instructions architecturally executed and failed.

but I do see (as @khuey points out) that some other chips have a similar event with "speculatively executed", which I'm not sure what exactly that means.

khuey commented 7 years ago

Ah, that's interesting. Unfortunately the A9 is so old that it doesn't support separating user space counts from kernel space counts, so any hopes of using its performance counters for rr died a long time ago.

Speculatively executed means the processor may have tried to perform some work (based off a branch prediction or whatever) that it ends up throwing away later. Having instructions that didn't architecturally execute show up in the counts makes them unsuitable, unfortunately.

Keno commented 7 years ago

How fast do these interrupts fire? Could we set an interrupt on STREX failing, even speculatively, then single step past it?

matt2909 commented 7 years ago

| How fast do these interrupts fire?

That entirely depends on the micro-architecture, but most aggressive implementations will not offer guarantees about the delay from event firing to the interrupt being taken. This "skew" can be many 10's of cycles in the extreme case.

You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mozilla/rr/issues/1373#issuecomment-251019394, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUR3h7_Cxy9UDBSOx5wyNnKrG9pmHt6ks5qwHW0gaJpZM4C82gf .

Keno commented 7 years ago

Do you happen to know of any other way to make strex instructions trap?

vielmetti commented 7 years ago

Reopening this issue; any active work on ARM going on now?

Keno commented 7 years ago

any active work on ARM going on now?

Unfortunately, no. I think the current consensus is that we'd have to get changes into the silicon (some mechanism to get an interrupt on strex failing) in order to make rr feasible on ARM.

andersjel commented 7 years ago

How about running rr in an ARM guest in qemu. The code for failing strex instructions is generated here.

rocallahan commented 7 years ago

I guess we could hack extra features into QEMU but that would not provide the "low overhead" or convenience that we're looking for with rr.

rpw commented 7 years ago

CoreSight tracing might be an alternative to using performance counters on ARM. In Linux kernels >= 4.9 and suitable hardware exposing the ETM macrocell in the device tree it is available through the perf interface:

http://events.linuxfoundation.org/sites/events/files/slides/ELC-E16.pdf

khuey commented 7 years ago

When I last looked into CoreSight (which was 2015) the biggest stumbling block was that OEMs often didn't provide the necessary data to enable it in the device tree in their data sheets. Whether that was because they didn't care about CoreSight or that their chips didn't contain that functionality I don't know.

On Aug 23, 2017 1:17 PM, "Ralf-Philipp Weinmann" notifications@github.com wrote:

CoreSight tracing might be an alternative to using performance counters on ARM. In Linux kernels >= 4.9 and suitable hardware exposing the ETM macrocell in the device tree it is available through the perf interface:

http://events.linuxfoundation.org/sites/events/files/slides/ELC-E16.pdf

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mozilla/rr/issues/1373#issuecomment-324435632, or mute the thread https://github.com/notifications/unsubscribe-auth/AAT5BAfegDxaTRK29PK2Myt-YHXtelBHks5sbHq7gaJpZM4C82gf .

rocallahan commented 7 years ago

How would Coresight help with https://github.com/mozilla/rr/issues/1373#issuecomment-151642482 ?

vielmetti commented 6 years ago

@khuey If you need access to OEM hardware information I should have a pretty good way to get that on Arm server-class equipment from Cavium, Huawei, Hisilicon through my work at @packethost .

rocallahan commented 5 years ago

What I would like to have from ARM is the ability to trigger a synchronous trap on a failed strex instruction. Then we could:

rocallahan commented 5 years ago

I don't think this is crazy feature to ask for, since strex failure is presumably infrequent and is not the "fast path".

bill-myers commented 5 years ago

How about patching all ldrex/strex instruction pairs in the program to do the load/store unconditionally and not doing task switching inside ldrex/strex pairs, thus guaranteeing deterministic behavior?

rocallahan commented 5 years ago

For complexity and performance reasons rr has so far avoided scanning the entire incoming instruction stream. That is something I'd very much like to preserve.

We can't just scan binaries offline either, since for example browser JITs generate atomic operations now.

xiangzhai commented 5 years ago

@khuey cool! I am learning your patch to support other architectures :)

khuey commented 5 years ago

@xiangzhai ok, feel free to email me if you have any questions

khuey commented 5 years ago

By the way, if you have not already done so, I would strongly suggest reading our technical report[0] before attempting to port rr to other hardware architectures. Sections 2.4, 2.6, and 5 are likely particularly relevant.

[0] https://arxiv.org/pdf/1705.05937.pdf

Keno commented 4 years ago

I have some news here:

  1. Modern ARM has non-linked atomics, so the ll/sc issue goes away if you make sure to build your applications using the new instruction set. That limits use cases a little bit, but is worth if for those who control how the binaries are built.
  2. I think it is possible to handle ll/sc by combining the speculative strex fail counter with ETM. The way to do it is to have ETM record the P0 trace into a circular buffer. Then when you take the strex trap, you can use ETM to see whether the speculation succeeded or failed and record it appropriately.
  3. I looked into some modern Aarch64 microarchitectures and the architectural counters appear to be sufficiently reliable for our use cases.

I hacked together a prototype, though I didn't validate the ETM part of this, since the kernel doesn't see it on the machine that I tried (even though it should be available). I'll try to do that, but in the meantime:

ubuntu@ip-172-31-86-244:~/rr-build$ ./bin/rr record uname -m
rr: Saving execution to trace directory `/home/ubuntu/.local/share/rr/uname-8'.
aarch64
ubuntu@ip-172-31-86-244:~/rr-build$ ./bin/rr replay -a
aarch64

I'll start putting up the patches.

khuey commented 4 years ago

I'd be more impressed if I saw an example that used something like pthreads :) We had single threaded AArch32 programs without any use of atomics worked back in early 2015.

The new instructions you're talking about are the ones from C3.2.13 of the ARM Architecture Reference Manual (I'm looking at the F.b revision)? While that's exciting there are some fundamental deployability challenges to a solution that requires them. In addition to requiring all the user's software to be built (including system and third party libraries and any JITs) without the old instructions so that this can work we'd also need to somehow identify when this requirement is not met so that we can give users useful explanations of how to fix their systems. That's non-trivial.

5 years ago I looked at some of the CoreSight stuff as a way to circumvent our problems with the LLSC pattern and remember that a) support for it in the kernel was very poor and b) very few devices had support for it (I believe ARM's fancy Juno development device was the only one that advertised it in the kernel) either because it's not included on the silicon or because manufacturers did not document the memory addresses of the components on their datasheets. My understanding is that Linaro has largely addressed (a) in the intervening time but (b) may still be an issue. There's also the obvious question of whether an ETM based solution actually works :)

Now, cynicism aside, I'm excited to see what you have :)

Keno commented 4 years ago

In addition to requiring all the user's software to be built (including system and third party libraries and any JITs)

Yes, this is fine for our use case :)

we'd also need to somehow identify when this requirement is not met so that we can give users useful explanations of how to fix their systems

Since we can count speculatively executed strex instructions, I assume we'd just count those and give a warning if the number if not zero. Now there may be corner cases where such an instruction is behind an architecture check that gets speculated through, but that may be acceptable.

(b) may still be an issue

Yes, it's an issue. I'm gonna go see if I can get my hands on a system that actually has it. That said, I think the no-ll/sc use case is still interesting even if ETM is not available.

I'll go build an LSE-atomics pthread, so you can be sufficiently impressed ;) (hopefully, if something else breaks unexpectedly, I'll be very sad).

Keno commented 4 years ago

I'd be more impressed if I saw an example that used something like pthreads :)

ubuntu@ip-172-31-86-244:~/rr/src/test$ ~/rr-build/bin/rr record ./thread_stress
rr: Saving execution to trace directory `/home/ubuntu/.local/share/rr/thread_stress-0'.
EXIT-SUCCESS
ubuntu@ip-172-31-86-244:~/rr/src/test$ ~/rr-build/bin/rr replay -a
EXIT-SUCCESS
ubuntu@ip-172-31-86-244:~/rr/src/test$ uname -m
aarch64

Happy ;)? Unfortunately, that seems to work with ll/sc also, so I probably need to find a better test case.

Keno commented 4 years ago

Alright, found a way to induce some spurious wake ups:

ubuntu@ip-172-31-86-244:~/rr/src/test$ ~/linux/tools/perf/perf record ~/rr-build/bin/rr replay -a ~/.local/share/rr/thread_stress_new-0
EXIT-SUCCESS
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 0.519 MB perf.data (9798 samples) ]
ubuntu@ip-172-31-86-244:~/rr/src/test$ ~/linux/tools/perf/perf record ~/rr-build/bin/rr replay -a ~/.local/share/rr/thread_stress_old-0/
[FATAL /home/ubuntu/rr/src/ReplaySession.cc:1045:check_ticks_consistency()]
 (task 77956 (rec:51581) at time 5853)
 -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: clone'; expected 184834, got 184836
Keno commented 4 years ago

(new is a libc with lse atomics exclusively, old is ll/sc atomics)

khuey commented 4 years ago

Sweet.

khuey commented 4 years ago

Provided we can reliably detect the LL/SC instructions via the performance counters and error out I think we'd merge code that only supported the new style atomics for now (though obviously ETM would still be awesome).

Keno commented 4 years ago

Provided we can reliably detect the LL/SC instructions via the performance counters

Yes, we can, I've tested this.

Keno commented 4 years ago

Alright, this is getting to a point where I think others can try it if they want to (disclaimer, it probably won't work). Here's a branch that's my latest PR, plus three hacky commits to comment out things that don't work yet: https://github.com/Keno/rr/tree/kf/aarch64hacks

Build instructions:

# Build libc with lse
mkdir glibc_prefix
mkdir glibc_build
git clone https://github.com/bminor/glibc
cd glibc_build
../glibc/configure --prefix=$PWD/../glibc_prefix CFLAGS="-march=armv8.3-a -O3 -g3"
make
make install

# Build rr against the custom libc
mkdir rr-build
git clone https://github.com/Keno/rr rr-keno-hacks
cd rr-keno-hacks
git checkout kf/aarch64hacks
cd ../rr-build
cmake -DCMAKE_C_FLAGS="-I$PWD/../glibc_prefix/include" -DCMAKE_EXE_LINKER_FLAGS="-L$HOME/usr/lib -Wl,--rpath=$PWD/../glibc_prefix/lib -Wl,--dynamic-linker=$PWD/../glibc_prefix/lib/ld-linux-aarch64.so.1"  -G Ninja ../rr
ninja

# Try it
cp /usr/lib/aarch64-linux-gnu/libkj-0.7.0.so .
cp /usr/lib/aarch64-linux-gnu/libcapnp-0.7.0.so .
cp /lib/aarch64-linux-gnu/libstdc++.so.6 .
LD_LIBRARY_PATH=$PWD ./bin/rr record --unmap-vdso -n ./bin/simple
Keno commented 4 years ago

I would like to validate this on another modern ARM chip to make sure I didn't accidentally make any AWS specific assumptions. @vielmetti rumor is Packet is getting Ampere Altra boxes. Any chance of hooking me up with access to see if this works? There's also still the question of Coresight. The AWS boxes have those disabled and they won't tell me whether that's just a firmware configuration or they didn't include the IP.

vielmetti commented 4 years ago

@Keno

Altra is on the horizon pending availability. We do have eMag now which you can spin up as c2.large.arm ($1/hr, less in the spot market). https://www.packet.com/cloud/servers/c2-large-arm/ Email me for a promo code, should be plenty to test on.

Coresight has been disabled on every server-class chip I have ever seen because of enormous security impacts.

Keno commented 4 years ago

Thanks for the offer. I think we'll have to wait for the Altra though, since the eMag does not have the requisite new instructions. Too bad about coresight.

khuey commented 4 years ago

Any updates here @Keno? Not entirely sure what if anything we're waiting on.

Keno commented 4 years ago

It's mostly complete. The big missing thing is syscallbuf support. Things just got a bit busy at the day job, so I haven't gotten around to it ;).

peterwaller-arm commented 4 years ago

Hi @Keno! Fantastic work, I'm excited to use it. I've been trying things out according to your instructions but hit failure cases. I'm trying to run an empty main(), compiled with gcc 7, with the custom libc and -march=armv8.3-a, on an N1 chip. I'm on Ubuntu 18.04.

I'm using aarch64hacks2 from your fork, which is at keno/rr@beb5093d4e9f7917a3deb6fcf35990c3d55375ba.

The error I get, during replay, is:

[FATAL ../src/Task.cc:2714:ptrace_if_alive() errno: EIO] 
 (task 1536 (rec:1533) at time 1)
 -> Assertion `!errno' failed to hold. ptrace(PTRACE_SYSEMU, 1536, addr=0, data=0) failed with errno 5
Tail of trace dump:
=== Start rr backtrace:
bin/rr(_ZN2rr13dump_rr_stackEv+0x48)[0xaaaaaeded35c]
bin/rr(_ZN2rr9GdbServer15emergency_debugEPNS_4TaskE+0x154)[0xaaaaaec45ff8]
bin/rr(+0x360aa4)[0xaaaaaec72aa4]
bin/rr(_ZN2rr21EmergencyDebugOstreamD1Ev+0x70)[0xaaaaaec72d48]
bin/rr(_ZN2rr4Task15ptrace_if_aliveEiNS_10remote_ptrIvEEPv+0x218)[0xaaaaaedb7140]
bin/rr(_ZN2rr4Task16resume_executionENS_13ResumeRequestENS_11WaitRequestENS_12TicksRequestEi+0x6c8)[0xaaaaaedb12a4]
bin/rr(_ZN2rr13ReplaySession21cont_syscall_boundaryEPNS_10ReplayTaskERKNS0_15StepConstraintsE+0x114)[0xaaaaaed4a008]
bin/rr(_ZN2rr13ReplaySession13enter_syscallEPNS_10ReplayTaskERKNS0_15StepConstraintsE+0x334)[0xaaaaaed4a674]
bin/rr(_ZN2rr13ReplaySession18try_one_trace_stepEPNS_10ReplayTaskERKNS0_15StepConstraintsE+0x144)[0xaaaaaed4e1bc]
bin/rr(_ZN2rr13ReplaySession11replay_stepERKNS0_15StepConstraintsE+0x10c)[0xaaaaaed4f314]
bin/rr(_ZN2rr14ReplayTimeline19replay_step_forwardENS_10RunCommandEl+0xd0)[0xaaaaaed672f0]
bin/rr(_ZN2rr9GdbServer12serve_replayERKNS0_15ConnectionFlagsE+0x58)[0xaaaaaec4507c]
bin/rr(+0x434170)[0xaaaaaed46170]
bin/rr(_ZN2rr13ReplayCommand3runERSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EE+0x40c)[0xaaaaaed46ae8]
bin/rr(main+0x21c)[0xaaaaaee0e370]

Other things I've noticed:

My other general question is: what would I expect to work at this point? Might I be able to use rr on smallish programs now? Given the lack of syscallbuf support, which I understand is mainly a performance issue. How about larger programs?

Thanks again!

khuey commented 3 years ago

I'm going to go ahead and say that this issue has reached the end of it's useful life.

Here's the state of the world:

AArch32 support is WONTFIX, as far as we are concerned. Even if implementing rr on 32 bit ARM were possible I don't think anybody cares about it at this point. AArch64 support is present, modulo the following:

GitMensch commented 2 years ago

I'd like to adjust the README for the current support state as it still says

rr currently requires either:

is it correct to say "certain AArch64 processors (see https://github.com/rr-debugger/rr/wiki/ARM)" with a new wiki page that has the "state of the world" from above (especially the syscalbuf with the reference to #2745)?

rocallahan commented 2 years ago

ARM support is still experimental so I'd leave the wiki as-is for now.

GitMensch commented 2 years ago

OK, as you like no entry in the README and no page in the wiki about experimental support.

Would it be useful for the project to test loongarch64 (I don't need rr on that machine but have access to it for some days)?

rocallahan commented 2 years ago

Loongarch64 won't work, rr would have to be ported. I have no idea if such a port would be possible (e.g. whether Loongarch64 has suitable performance counters).