Emulated mlock + MADV_DONTNEED diverges

napaalm commented 9 months ago

I have a trace of Mixxx 2.3.6 run on an Arch Linux host which causes the above error when replayed.

Here's the complete log of rr replay -a: replay.log.

I can provide the trace files, but I think it's very easy to reproduce:

Install mixxx with your package manager
Run rr record mixxx -platform xcb and stop the execution when the file manager window appears

khuey commented 9 months ago

Does replaying a trace recorded with rr record -n work?

napaalm commented 9 months ago

@khuey it doesn't work. I get the following error: [FATAL src/replay_syscall.cc:1242:rep_process_syscall_arch()]

Here's the new replay.log

khuey commented 9 months ago

Can you run rr dump -p -m -b on that trace and attach that output?

napaalm commented 9 months ago

Can you run rr dump -p -m -b on that trace and attach that output?

@khuey here's the output using the last trace (with -n): dump.txt.gz

khuey commented 9 months ago

Ok, we're diverging at madvise(<map>, <size>, MADV_DONTNEED). <size> covers most but not all of the mapping. The mapping was PROT_NONE but also MAP_PRIVATE. During recording the madvise returned -EINVAL.

Earlier in the trace there was an mlockall that used MCL_FUTURE. mlockall is emulated during replay, so the lock state varies between recording and replay, and that causes the divergence at this madvise.

khuey commented 9 months ago

Our mlock support goes way back to 9716b66798227f9f93cc8c4a85b256728a05b479 in 2015. @rocallahan, seems like this has always been broken w.r.t MADV_DONTNEED, no?

napaalm commented 9 months ago

@khuey @rocallahan is there any simple workaround that I can use for the time being?

rocallahan commented 9 months ago

I don't think there's a simple workaround, except maybe "make the application not call MADV_DONTNEED on locked memory, since it doesn't work and breaks rr".

This is not trivial to fix. Here's what happens (in my testcase, which I assume is representative of the application):

Application does mlock()
Application does madvise(MADV_DONTNEED). This completely fails since the region is locked, no memory values change.
During replay, rr ignores mlock()
During replay, rr passes through the madvise(MADV_DONTNEED). Note that normally rr replay skips failed syscalls, but we have an exception for madvise() because there are (or were) cases where madvise() could return errors but still make some changes to memory.
Because the memory is not locked in replay, that madvise() clears the memory. It was not cleared during recording, so we've diverged.

rocallahan commented 9 months ago

Options for a fix:

During replay, track which memory is locked and emulate the error behavior. Not good for several reasons. There is optional "lock on fault" behavior which we can't really track. We'd have to test madvise() carefully to figure out whether it can do work before returning the error. And this tracking and emulation is quite a lot of work.
During recording, somehow figure out which pages were actually cleared by madvise() and arrange for exactly those pages to be cleared during replay. Annoying, since we'd have to put that information in the trace somehow and handle trace compatibility issues. Figuring out which pages were cleared might be slow, unreliable or impossible. (We can approximate it using PFNs maybe...)

Neither of those options are appealing. I'll try to think of another one.

rocallahan commented 9 months ago

Actually I think the only case where madvise() does work and then returns an error is when it fails with ENOMEM. In the case where care about with locked memory, it returns EINVAL. So I think we just need to skip replaying if it returned anything but ENOMEM.

However, there's another complication, which is that madvise() gets syscall buffering, so we also need to do something there...

rocallahan commented 9 months ago

The syscall buffering situation is painful.

One possible option: before we execute a buffered madvise(MADV_DONTNEED), try a madvise(MADV_COLD) of the same range. If any pages are locked etc that will return an error, in which case we can go to a traced madvise(MADV_DONTNEED) which lets rr do the right thing. If there's no error, we can proceed to a buffered madvise(MADV_DONTNEED). Two issues with this option:

MADV_COLD is 5.4 only. We'd have to disable buffered MADV_DONTNEED on pre-5.4 kernels.
I wonder if MADV_COLD would hurt performance. I think probably not since MADV_DONTNEED drops the memory backing for those pages anyway.

khuey commented 9 months ago

Actually I think the only case where madvise() does work and then returns an error is when it fails with ENOMEM

madvise can definitely partially work

#include <assert.h>
#include <errno.h>
#include <stddef.h>
#include <sys/mman.h>

int main(void) {
  int ret;
  void* p = mmap(NULL, 4096*3, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
  int* pp = p;
  int* ppp = p + 4096*2;
  assert(p);
  *pp = 42;
  *ppp = 42;
  ret = mlock(p + 4096, 4096);
  assert(ret == 0);
  ret = madvise(p, 4096*3, MADV_DONTNEED);
  assert(ret == -1 && errno == EINVAL);
  assert(*ppp == 42);
  assert(*pp == 42);
  return 0;
}

rocallahan commented 9 months ago

ugh, that sucks.

rocallahan commented 9 months ago

The MADV_COLD trick works to get this issue booted out of syscallbuf. For the unbuffered syscalls we'll need one of my original two options. I lean towards reading the PFNs.

khuey commented 9 months ago

Personally I would also vote for reading kernel state over maintaining our own shadow state.

rocallahan commented 9 months ago

While I was working on this I realized that we have a big problem if we want to handle madvise() of MAP_PRIVATE file mappings correctly in syscall-buffered code. It's OK if the file is actually a mapped file with the right contents during replay, but if during replay it's actually an anonymous mapping that we've populated with trace data, our buffered madvise() will zero it. The MADV_COLD trick won't help.

rocallahan commented 9 months ago

So I think the fix for that should be a separate change to make sure that private mappings of files during recording are private mappings of files during replay.

rocallahan commented 9 months ago

This does, however, raise the question of what we should store in the trace for non-buffered madvise(). I was planning to record actual values --- ranges of zeroes, for anonymous VMAs. For private file-backed mappings, we could store the actual restored data, but that would be inconsistent with syscall buffering where we won't store anything.

Maybe it's cleaner to add something to event.syscall.extra to record the range(s) affected by the madvise instead of using MemWrite records.

rocallahan commented 8 months ago

5b02260ab8b524e6fd936ae6b36ded46a9a0f451

rr-debugger / rr

Emulated mlock + MADV_DONTNEED diverges #3667