rr-debugger / rr

Record and Replay Framework
http://rr-project.org/
Other
9.1k stars 581 forks source link

Emulated mlock + MADV_DONTNEED diverges #3667

Closed napaalm closed 8 months ago

napaalm commented 9 months ago

I have a trace of Mixxx 2.3.6 run on an Arch Linux host which causes the above error when replayed.

Here's the complete log of rr replay -a: replay.log.

I can provide the trace files, but I think it's very easy to reproduce:

khuey commented 9 months ago

Does replaying a trace recorded with rr record -n work?

napaalm commented 9 months ago

@khuey it doesn't work. I get the following error: [FATAL src/replay_syscall.cc:1242:rep_process_syscall_arch()]

Here's the new replay.log

khuey commented 9 months ago

Can you run rr dump -p -m -b on that trace and attach that output?

napaalm commented 9 months ago

Can you run rr dump -p -m -b on that trace and attach that output?

@khuey here's the output using the last trace (with -n): dump.txt.gz

khuey commented 9 months ago

Ok, we're diverging at madvise(<map>, <size>, MADV_DONTNEED). <size> covers most but not all of the mapping. The mapping was PROT_NONE but also MAP_PRIVATE. During recording the madvise returned -EINVAL.

Earlier in the trace there was an mlockall that used MCL_FUTURE. mlockall is emulated during replay, so the lock state varies between recording and replay, and that causes the divergence at this madvise.

khuey commented 9 months ago

Our mlock support goes way back to 9716b66798227f9f93cc8c4a85b256728a05b479 in 2015. @rocallahan, seems like this has always been broken w.r.t MADV_DONTNEED, no?

napaalm commented 9 months ago

@khuey @rocallahan is there any simple workaround that I can use for the time being?

rocallahan commented 9 months ago

I don't think there's a simple workaround, except maybe "make the application not call MADV_DONTNEED on locked memory, since it doesn't work and breaks rr".

This is not trivial to fix. Here's what happens (in my testcase, which I assume is representative of the application):

rocallahan commented 9 months ago

Options for a fix:

Neither of those options are appealing. I'll try to think of another one.

rocallahan commented 9 months ago

Actually I think the only case where madvise() does work and then returns an error is when it fails with ENOMEM. In the case where care about with locked memory, it returns EINVAL. So I think we just need to skip replaying if it returned anything but ENOMEM.

However, there's another complication, which is that madvise() gets syscall buffering, so we also need to do something there...

rocallahan commented 9 months ago

The syscall buffering situation is painful.

One possible option: before we execute a buffered madvise(MADV_DONTNEED), try a madvise(MADV_COLD) of the same range. If any pages are locked etc that will return an error, in which case we can go to a traced madvise(MADV_DONTNEED) which lets rr do the right thing. If there's no error, we can proceed to a buffered madvise(MADV_DONTNEED). Two issues with this option:

khuey commented 9 months ago

Actually I think the only case where madvise() does work and then returns an error is when it fails with ENOMEM

madvise can definitely partially work

#include <assert.h>
#include <errno.h>
#include <stddef.h>
#include <sys/mman.h>

int main(void) {
  int ret;
  void* p = mmap(NULL, 4096*3, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
  int* pp = p;
  int* ppp = p + 4096*2;
  assert(p);
  *pp = 42;
  *ppp = 42;
  ret = mlock(p + 4096, 4096);
  assert(ret == 0);
  ret = madvise(p, 4096*3, MADV_DONTNEED);
  assert(ret == -1 && errno == EINVAL);
  assert(*ppp == 42);
  assert(*pp == 42);
  return 0;
}
rocallahan commented 9 months ago

ugh, that sucks.

rocallahan commented 9 months ago

The MADV_COLD trick works to get this issue booted out of syscallbuf. For the unbuffered syscalls we'll need one of my original two options. I lean towards reading the PFNs.

khuey commented 9 months ago

Personally I would also vote for reading kernel state over maintaining our own shadow state.

rocallahan commented 9 months ago

While I was working on this I realized that we have a big problem if we want to handle madvise() of MAP_PRIVATE file mappings correctly in syscall-buffered code. It's OK if the file is actually a mapped file with the right contents during replay, but if during replay it's actually an anonymous mapping that we've populated with trace data, our buffered madvise() will zero it. The MADV_COLD trick won't help.

rocallahan commented 9 months ago

So I think the fix for that should be a separate change to make sure that private mappings of files during recording are private mappings of files during replay.

rocallahan commented 9 months ago

This does, however, raise the question of what we should store in the trace for non-buffered madvise(). I was planning to record actual values --- ranges of zeroes, for anonymous VMAs. For private file-backed mappings, we could store the actual restored data, but that would be inconsistent with syscall buffering where we won't store anything.

Maybe it's cleaner to add something to event.syscall.extra to record the range(s) affected by the madvise instead of using MemWrite records.

rocallahan commented 8 months ago

5b02260ab8b524e6fd936ae6b36ded46a9a0f451