Open shravanrn opened 3 years ago
I think that's EBADF
which is a bit strange.
You'd have to dig into rr to figure out which fd we tried to map in the tracee and whether that's valid in the tracee when we try to get the tracee to mmap it.
Are you using the rr from the distro? You should test with a build from the latest source if you are.
@khuey I'm using the rr 5.4 release from the releases page here. I will try building from the source to see if I can repro on the latest also
You'd have to dig into rr to figure out which fd we tried to map in the tracee and whether that's valid in the tracee when we try to get the tracee to mmap it.
@rocallahan hmm... Happy to give it a shot, although I may need a little more info on the steps i should take to do this. Alternately, if I attach an rr core dump or something similar, would this help identify the root cause?
You'll want to build from source first as Kyle suggested. After that, reproduce the bug, and when rr pauses as before, attach gdb to it and get a stack trace. In AutoRemoteSyscalls::finish_direct_mmap we'll want the value of fd
. And that that point, grab the tid of the tracee (t->tid) and then lsof -p <tid>
.
@rocallahan @khuey I see the same error on the rr built from the latest source as well.
I attached gdb to rr
AutoRemoteSyscalls::finish_direct_mmap we'll want the value of fd
This says -75. This definitely looks weird, as the value is negative...
grab the tid of the tracee (t->tid) and then lsof -p
.
shr@shr-Desktop:~$ lsof -p 50471
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/121/gvfs
Output information may be incomplete.
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
rr:xpcshe 50471 shr cwd DIR 8,1 4096 65408719 /mnt/sata/Code/mozilla-central
rr:xpcshe 50471 shr rtd DIR 259,1 4096 2 /
rr:xpcshe 50471 shr txt REG 8,1 13960 65735321 /mnt/sata/Code/rr-build/bin/rr_exec_stub_32
rr:xpcshe 50471 shr mem REG 8,1 7265112 169413630 /mnt/sata/ffbuilds/firefox_latest_debug32/dist/bin/xpcshell
rr:xpcshe 50471 shr mem REG 8,1 18544 68879637 /mnt/sata/Code/rr-build/lib/rr/librrpage_32.so
rr:xpcshe 50471 shr DEL REG 259,1 134451 /tmp/rr-shared-preload_thread_locals-50471-0
rr:xpcshe 50471 shr mem REG 8,1 27436 146542252 /mnt/sata/ffbuilds/firefox_latest_debug32/config/external/nspr/ds/libplds4.so
rr:xpcshe 50471 shr mem REG 8,1 49776 146542253 /mnt/sata/ffbuilds/firefox_latest_debug32/config/external/nspr/libc/libplc4.so
rr:xpcshe 50471 shr mem REG 8,1 638936 124324203 /mnt/sata/ffbuilds/firefox_latest_debug32/config/external/nspr/pr/libnspr4.so
rr:xpcshe 50471 shr mem REG 259,1 159712 13238949 /lib/i386-linux-gnu/libpthread-2.31.so
rr:xpcshe 50471 shr mem REG 8,1 323464 68879641 /mnt/sata/Code/rr-build/lib/rr/librrpreload_32.so
rr:xpcshe 50471 shr mem REG 259,1 180628 13238872 /lib/i386-linux-gnu/ld-2.31.so
rr:xpcshe 50471 shr 0u CHR 136,10 0t0 13 /dev/pts/10
rr:xpcshe 50471 shr 1u CHR 136,10 0t0 13 /dev/pts/10
rr:xpcshe 50471 shr 2u CHR 136,10 0t0 13 /dev/pts/10
rr:xpcshe 50471 shr 999w CHR 1,3 0t0 6 /dev/null
rr:xpcshe 50471 shr 1001u unix 0x0000000000000000 0t0 203202 type=STREAM
I also attached gdb per the command given in the last line (i guess this attaches to the tracee?) and unfortunately didn't see anything useful
> gdb '-l' '10000' '-ex' 'set sysroot /' '-ex' 'target extended-remote 127.0.0.1:49310' /mnt/sata/ffbuilds/firefox_latest_debug32/dist/bin/xpcshell
0x70000011 in syscall_priv_untraced ()
(gdb) bt
#0 0x70000011 in syscall_priv_untraced ()
#1 0x00000000 in ?? ()
(gdb) info threads
Id Target Id Frame
* 1 Thread 49223.49223 (xpcshell) 0x70000011 in syscall_priv_untraced ()
I don't see how it could be -75. AutoRemoteSyscalls::check_syscall_result
should be called by infallible_syscall(syscall_number_for_openat(arch()))
, and check_syscall_result
should trigger an assertion failure for ret == -75. Can you figure out why that didn't happen?
@rocallahan It seems this check is bypassed by an integer truncation issue. infallible_syscall gets a long ret
value of 4294967221
which is more than 2^32. This is then checked as a long to print an error if the value is (-4096 < ret && ret < 0)
. However, finish_direct_mmap
uses the result and converts it implicitly into an int
which treats this value as -75
, which subsequently triggers an error
I'm not sure why the call to infallible_syscall
returns this value in the first place or whether this is a valid/invalid return for this syscall. Assuming it is useful, let me know if there is a way I can provide any info on the return from infallible_syscall
a long ret value of 4294967221 which is more than 2^32
It's actually less but I see your point. I guess on a 32-bit architecture we should be treating ret
as 32-bit for that error check.
Looks like mmap is returning
EOVERFLOW
On 32-bit architecture together with the large file extension (i.e., using 64-bit off_t): the number of pages used for length plus number of
pages used for offset would overflow unsigned long (32 bits).
So I think we need to know the parameters to AutoRemoteSyscalls::finish_direct_mmap
, in particular backing_offset_pages
and length
in AutoRemoteSyscalls::finish_direct_mmap
. Sounds like that's overflowing and I'm not sure why... backing_file_name
and the actual length of that file in the trace (plus whatever else you can tell us about that file) would be useful too.
It's actually less but I see your point.
Ah right! :)
So I think we need to know the parameters to AutoRemoteSyscalls::finish_direct_mmap,
@rocallahan Pasting the parameters to finish_direct_mmap below.
rec_addr = {ptr = 3703197696}
length = 455916345
prot = 1
flags = 2050
backing_file_name = "/mnt/sata/ffbuilds/firefox_latest_debug32/toolkit/library/build/libxul.so"
backing_file_open_flags = 0
backing_offset_pages = 0
real_file = (stat &) @0x7ffc011e8980: {st_dev = 140720327265536, st_ino = 94373161760440, st_nlink = 140720327264688, st_mode = 4128824699, st_uid = 21972, st_gid = 0, __pad0 = 0, st_rdev = 94373161760440, st_size = 140720327264720, st_blksize = 94373161760440, st_blocks = 140720327265128, st_atim = {tv_sec = 94373161743456, tv_nsec = 140720327264784}, st_mtim = {tv_sec = 94373150256432, tv_nsec = 0}, st_ctim = {tv_sec = 94373161760432, tv_nsec = 94373152610728}, __glibc_reserved = {140720327265128, 94373161743456, 94373161743456}}
real_file_name = ""
the actual length of that file in the trace (plus whatever else you can tell us about that file)
The file is a large shared library Size: 2212751820
i.e. 2.2 GB. Wondering if this is in some way a case of rr replay hitting a OOM?
I'm guessing you may know more about libxul.so
than me given that rr was developed/tested with firefox, but essentially most of firefox code lives in this shared library. This shared library is loaded by relatively thin main binary (I think firefox is designed this way so that the multi-process architecture leverages OS support for shared object code page sharing). This library is from a local Firefox build that is compiled with -O0 -g
which probably explains the size. Is there any additional info that you would need on this file?
I cannot explain why we'd get EOVERFLOW with those parameters. "the number of pages used for length plus number of pages used for offset" would only be 455916345/4096 which is certainly less than 2^32.
If you want to dig into this further, try writing a C program that executes mmap with those parameters (opening that specific file and then mapping it), build it as 32 bits, and see if it gets EOVERFLOW too.
I'm using rr to record a 32-bit build of Firefox, on an Ubuntu 20.04.3 LTS system, however I hit an assert about mmap (log below).
Info
Recorded code: custom build of Firefox (from the main mozilla-central repo with commits from around Aug 1st) OS: Ubuntu 20.04.4 LTS Processor: Intel Skylake 6700K
Any help would be much appreciated! Please let me know if you need any more information.
Error log