Open seelabs opened 3 years ago
Is the instruction at 0xb8b83e2
an RDRAND by chance?
@khuey It appears to be a shift left from code in openssl. Although the instruction before it is rdtsc
which is "read timestamp counter". Maybe that's the issue. Here's the code around that instruction if it helps:
0xb8b83c0 <OPENSSL_atomic_add> mov (%rdi),%eax │
│ 0xb8b83c2 <OPENSSL_atomic_add+2> lea (%rsi,%rax,1),%r8 │
│ 0xb8b83c6 <OPENSSL_atomic_add+6> lock cmpxchg %r8d,(%rdi) │
│ 0xb8b83cb <OPENSSL_atomic_add+11> jne 0xb8b83c2 <OPENSSL_atomic_add+2> │
│ 0xb8b83cd <OPENSSL_atomic_add+13> mov %r8d,%eax │
│ 0xb8b83d0 <OPENSSL_atomic_add+16> cltq │
│ 0xb8b83d2 <OPENSSL_atomic_add+18> repz ret │
│ 0xb8b83d4 data16 nopw %cs:0x0(%rax,%rax,1) │
│ 0xb8b83df nop │
│ 0xb8b83e0 <OPENSSL_rdtsc> rdtsc │
│B+>0xb8b83e2 <OPENSSL_rdtsc+2> shl $0x20,%rdx │
│ 0xb8b83e6 <OPENSSL_rdtsc+6> or %rdx,%rax │
│ 0xb8b83e9 <OPENSSL_rdtsc+9> repz ret │
│ 0xb8b83eb nopl 0x0(%rax,%rax,1) │
│ 0xb8b83f0 <OPENSSL_ia32_cpuid> mov %rbx,%r8 │
│ 0xb8b83f3 <OPENSSL_ia32_cpuid+3> xor %eax,%eax │
│ 0xb8b83f5 <OPENSSL_ia32_cpuid+5> mov %rax,0x8(%rdi) │
│ 0xb8b83f9 <OPENSSL_ia32_cpuid+9> cpuid │
│ 0xb8b83fb <OPENSSL_ia32_cpuid+11> mov %eax,%r11d │
│ 0xb8b83fe <OPENSSL_ia32_cpuid+14> xor %eax,%eax │
│ 0xb8b8400 <OPENSSL_ia32_cpuid+16> cmp $0x756e6547,%ebx
Ah, ok, those instruction trap events are from the rdtsc then. That should be fine.
Unfortunately debugging these kinds of issues is pretty tough. If you move a recording made on the Intel machine to the AMD machine does it replay? How about the reverse?
@khuey No, I am not able to replay a recording made on the Intel machine and moved to the AMD machine. I get the error:
[FATAL ../../src/ReplaySession.cc:187:ReplaySession()] Trace was recorded with CPUID faulting enabled, but this
system does not support CPUID faulting.
I am also unable to replay a recoding made on the AMD machine on the Intel machine, but this is due to different software being installed. I get the error:
[FATAL ../src/TraceStream.cc:1030:read_mapped_region() errno: ENOENT] Failed to stat /usr/lib/x86_64-linux-gnu/ld-2.31.so: replay is impossible
Is the software you're recording here something you can share a trace of? Or is it proprietary/private/whatever?
@khuey I'm working on the xrp ledger (https://github.com/ripple/rippled). I'm happy to share traces. What would you like? Just the data
and events
files or the whole directory with those files? Unfortunately, that directory is almost a gig, but I'm happy to upload it if it'll help.
I don't expect you to build it yourself, but if you'd like to reproduce it yourself: I was on the develop
branch. The build instructions are here: https://xrpl.org/build-run-rippled-ubuntu.html. I'm more than happy to help you build it if you run into issues. Recording and replaying the PayStrand
unit test should reproduce the issue (rippled --unittest=PayStrand
).
Given that the problem is specific to a CPU that I don't have, I'm not going to be able to reproduce it myself.
If you could produce a failing trace with rr record
, then rr pack
it, and upload it somewhere or email it to me that would be helpful.
@khuey I ran rr pack
tar'd up the directory and uploaded it to this repo: https://github.com/seelabs/rr_traces. I had to split the files to get around github's file limit, so you need to recombine them with cat
(see the README file).
Thanks. Unfortunately I can't run that trace locally because it uses the new fancy SHA256 instructions and my CPU is too old for that (CPUs with this capability also seem to be new enough that they haven't made it into the inventory of cloud vendors like AWS).
I was, however, able to examine the binaries. Since this is AMD specific and your program does cryptography I strongly suspect the problem here is an rdrand instruction is being executed. On Intel CPUs we can intercept the CPUID instruction and mask off the bits for rdrand, There's no equivalent capability on AMD hardware, and we've run into this problem before (e.g. #2766). I see two functions in the rippled binary that use rdrand. One is OPENSSL_ia32_rdrand_bytes
, the other is std::(anonymous namespace)::__x86_rdrand(void*)
. We've had a hack to disable OpenSSL's rdrand use for a long time (https://github.com/rr-debugger/rr/blob/d3b38fc768a45fc119f635b099b322960d5c449a/src/RecordSession.cc#L2214) so I suspect the standard library is the culprit here. If you can replay this trace with a breakpoint on __x86_rdrand
and hit that breakpoint, then that is the problem.
rippled does make use of std::random_device
. You might try changing the constructors in rippled to pass "/dev/urandom"
as an argument (instead of relying on the default constructor) which will make libstdc++ use /dev/urandom instead of rdrand, which will be fine for rr.
Unfortunately there's not a ton we can do here from the rr side. Even if we got a patch into libstdc++ to allow disabling the use of rdrand via an environment variable or something it would take a long time for it to reach users.
I set a breakpoint on __x86_rdrand
and reran. It hits the assertion before it hits the breakpoint. Still, your diagnosis of rdrand
being the culprit is a good theory. With that hint I'll spend some time on my side seeing if I can resolve this. Thanks for all the effort you put into this. Much appreciated!
Hrm, ok. If you could generate a trace without the SHA256 instructions somehow I can take another look.
@khuey Yeah, I should be able to do that.
Hmm, I'm not building with any -march
flags. I wonder if a library (maybe OpenSSL) is dynamically querying for the cpu type and using SHA256 from that. Let me see what I can find about that.
Edit: or more likely the OpenSSL library was compiled with -march
. I'll see about re-compiling that library.
So far I'm not sure how to get it to compile without the SHA256
asm instructions. I tried adding a global -DOPENSSL_NO_ASM
but when I look as the disassembled executable, the instructions are still there. I'll try again later.
On interesting data point: when I re-compiled with clang, rr ran no problems. When I compiled with gcc again, I hit the assertion.
I've just hit it rr 5.6.0-2 on Debian, also running on AMD CPU.
Is recompiling from sources w/ clang still the only fix or workaround?
You hit it on which CPU?
AMD Threadripper Pro 3995WX.
Recompiling rr
from sources w/ clang appears to work.
@seelabs Am I right that your workaround was to recompile openssl with clang, which presumably did not use the sHA256 instruction [and may do now with an updated version]?
[...] I am not able to replay a recording made on the Intel machine and moved to the AMD machine. I get the error:
[FATAL ../../src/ReplaySession.cc:187:ReplaySession()] Trace was recorded with CPUID faulting enabled, but this system does not support CPUID faulting.
I am also unable to replay a recoding made on the AMD machine on the Intel machine, but this is due to different software being installed. I get the error:
[FATAL ../src/TraceStream.cc:1030:read_mapped_region() errno: ENOENT] Failed to stat /usr/lib/x86_64-linux-gnu/ld-2.31.so: replay is impossible
Both directions may work if you do, with a current rr
:
# on your recording machine
$> rr record --disable-cpuid-faulting yourprog
$> rr pack
# then transfer the recording and on the target machine do
$> rr replay --disable-cpuid-faulting yourtracedir
In any case: Is the original issue solved "by just updating rr" or is this still happening (when using system-default OpenSSL)?
I'm running rr build from this commit: e9ec388d95ae05bc2eceb0361aea39510f08e17e on an "AMD Ryzen Threadripper 3970X 32-Core Processor". I run a script that does the following:
When I playback a recording, rr complains with the following message (I get a similar error leaving off
perf_cpu_time_max_percent
in the script, but thought I'd mention I tried that):Please let me know if there's anything I can do on my end to help resolve this. I'll note that I am able to record and replay the same program on my intel based system.