Keno commented 7 years ago

> $HOME/rr-build/bin/rr record $HOME/rr-build/bin/flock
rr: Saving execution to trace directory `/global/homes/k/kfischer/.local/share/rr/flock-7'.
parent pid is 61556
sizeof(flock) = 32
before lock: type: 2, pid: 0
  after lock: type: 0, pid: 61556
sizeof(flock64) = 32
before lock: type: 2, pid: 0
  after GETLK: type: 1, pid: 61556
P: forcing child to block on LK, sleeping ...
FAILED: errno=4 (Interrupted system call)
flock: /global/homes/k/kfischer/rr/src/test/flock.c:98: main: Assertion `"FAILED: !" && check_cond(0 == err)' failed.
P: ... awake, releasing lock
FAILED: errno=0 (Success)
flock: /global/homes/k/kfischer/rr/src/test/flock.c:120: main: Assertion `"FAILED: !" && check_cond(((((__extension__ (((union { __typeof(status) __in; int __i; }) { .__in = (status) }).__i))) & 0x7f) == 0) && 0 == ((((__extension__ (((union { __typeof(status) __in; int __i; }) { .__in = (status) }).__i))) & 0xff00) >> 8))' failed.
Aborted
> stat --file-system --format=%T .
lustre

Works fine without rr as well as with -n. Mount options are rw,flock,lazystatfs in case it makes a difference.

Keno commented 7 years ago

lfs 2.7.1.11 in case it matters.

rocallahan commented 7 years ago

I’m looking into the flock test failure on lustrefs. My best is that lustre doesn’t like us interrupting it with the desched sig. How do we usually handle that? Do we explicitly restart the system call somewhere?

We don't handle anything like that currently. The closest I've seen was when trying to add syscall buffering for epoll_wait, which I think hit the kernel bug referenced here: https://www.varnish-cache.org/lists/pipermail/varnish-commit/2016-January/014928.html. I ended up just abandoning that.

rocallahan commented 7 years ago

The immediate problem that makes this hard to fix is knowing when the flock call would really complete if it exits with EINTR whenever we get descheduled. You could modify the syscallbuf to automatically retry with a traced flock after EINTR but of course that might be incorrect if the EINTR occurred legitimately. A hack that would work OK is to disable flock syscall buffering if lustrefs is mounted.

Probably should report a lustrefs bug in any case.

brianjmurrell commented 7 years ago

Does it make any difference if you change the flock mount option to localflock?

Keno commented 7 years ago

Unfortunately, I cannot change any mount options on this system, so it's not easy for me to try that.

Keno commented 7 years ago

I'm gonna try the KVM setup guide at http://wiki.lustre.org/KVM_Quick_Start_Guide, and see if I can get a dev setup running on one of my machines.

verygreen commented 7 years ago

what's Lustre version in use? Is there a way to get the test binary without having to get a lot of dependencies for self building?

Keno commented 7 years ago

As far as I know the version mentioned above is the relevant version here (if not please let me know what command to run). rr has no dependencies other than cmake, so just cloning this repository and doing a standard cmake build should be sufficient. I'd also be happy to do a build and zip up the result, but build from source should be very straightforward. The relevant command to try is in the original post (adjust build location accordingly of course). In my tests it deterministically failed when run in a lustre directory, but passed on all other file systems I've tried (ext4, btrfs, gpfs, temps). Also I was unable to get a lustre dev setup running, so I did not try the suggestion above.

verygreen commented 7 years ago

Ah, 2.7.1.11, hm, that's a strange version number, I guss that just means 2.7.1 or something close to it. (cat /proc/fs/lustre/version should be good enough).

The filesystems you tried are all local. Did you try any other network filesystems, like say nfs4 (should be the easiest to setup).

I am going to try this on my local lustre setup and see what happens. What's the distro (I only care due to the kernel), rhel6? rhel7? ubuntu of some sort?

Keno commented 7 years ago

As far as I'm aware gpfs is distributed. Distribution is Cray Linux, with the kernel based on 3.12 as far as I can tell:

kfischer@cori11:~> cat /proc/fs/lustre/version
lustre: 2.7.1.11
kernel: patchless_client
build:  2.7.1.11-trunk-1.0600.f2563c6.3.2-abuild-lustre-filesystem.g
it@f2563c6-2016-10-26-20:44
kfischer@cori11:~> uname -r
3.12.60-52.57.1.11767.0.PTF.996988-default

verygreen commented 7 years ago

Ah, sles12 for Cray, I think. Yes, gpfs is distributed, I just missed it in the list. Thanks. I'll try it shortly to see what's going on here.

verygreen commented 7 years ago

Also Cray explains why the version is so strange. They roll a bunch of their own patches in and their tree sometimes has significant departures from mainline at times.

verygreen commented 7 years ago

hm, it does seem o be pulling a bunch of stuff that I don't have on my test nodes. Can I have just the 64 bit binary that would run on rhel7, please?

paf-49 commented 7 years ago

Just sticking my oar in so I can get the binary as well... I'm on the Cray Lustre side. (Green is right about the version info ;) )

Also, if you wanted to report this to the Cray site staff so they can open a bug, that would be helpful too... (Even if Green or I figure out the problem and create a patch without you going through Cray, you'll need that bug so we can get the fix installed on Cori.)

Keno commented 7 years ago

Ok, here's the build directory (built on Cori, so it's the same binary I've been using for tests): http://anubis.juliacomputing.io:8844/rr-build.tar.gz. Please let me know if it doesn't work (e.g. due to libc version, etc), in which case I'll spin up a CentOS machine and build one there.

Also I may have been wrong about gpfs. It appears the compute notes on Cori use Cray DataWarp for the home directory, so while the backing file system is gpfs, there's some Cray magic in there as well. I also tried running the test on the login node's home directory, which does appear to be pure gpfs (which is why I thought it would be so on the compute nodes as well) and I was able to reproduce the same behavior I saw with lustre, so this may be a more general problem. My apologies for the incorrect information.

paf-49 commented 7 years ago

Just out of curiosity, did the problem happen with GPFS-via-DataWarp?

Trying to execute it: [root@cent7c01 bin]# ./rr record flock rr: Saving execution to trace directory `/root/.local/share/rr/flock-2'. [FATAL /global/homes/k/kfischer/rr/src/PerfCounters.cc:261:start_counter() errno: ENOENT] Unable to open performance counter with 'perf_event_open'; are perf events enabled? Try 'perf record'.

Perf record works on this system. Don't have referenced source, so can't easily dig further.

Keno commented 7 years ago

Just out of curiosity, did the problem happen with GPFS-via-DataWarp?

No, that was fine, plain gpfs did appear to have the problem however.

Perf record works on this system. Don't have referenced source, so can't easily dig further.

This was built from unmodified master, so sources are the same as those in this repository. What's /proc/cpuinfo? Any virtualization techniques in use (some don't preserve the performance counters we need)? Also can you check which counter failed by attaching gdb and getting me a backtrace?

paf-49 commented 7 years ago

CentOS 7, VMWare ESXi.

processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz stepping : 7 microcode : 0x70d cpu MHz : 2194.711 cache size : 16384 KB physical id : 0 siblings : 1 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm ida arat epb pln pts dtherm xsaveopt bogomips : 4389.42 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management:

I'll get a backtrace momentarily.

Keno commented 7 years ago

Hmm, VMWare is usually fine I thought. Try perf list | grep "Hardware event"?

paf-49 commented 7 years ago

[root@cent7c01 ~]# perf list | grep "Hardware event" ref-cycles [Hardware event] stalled-cycles-backend OR idle-cycles-backend [Hardware event] stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]

paf-49 commented 7 years ago

GDB:

Starting program: /shared/paf/rr-build/bin/./rr record ./flock [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". [New Thread 0x7ffff66b9700 (LWP 2591)] [New Thread 0x7ffff5bb7700 (LWP 2592)] [New Thread 0x7ffff4db5700 (LWP 2593)] [New Thread 0x7fffeffff700 (LWP 2594)] [New Thread 0x7fffef7fe700 (LWP 2595)] [New Thread 0x7fffeeffd700 (LWP 2596)] [New Thread 0x7fffee7fc700 (LWP 2597)] [New Thread 0x7fffedffb700 (LWP 2598)] [New Thread 0x7fffed7fa700 (LWP 2599)] rr: Saving execution to trace directory `/root/.local/share/rr/flock-8'. Detaching after fork from child process 2600. [FATAL /global/homes/k/kfischer/rr/src/PerfCounters.cc:261:start_counter() errno: ENOENT] Unable to open performance counter with 'perf_event_open'; are perf events enabled? Try 'perf record'.

Program received signal SIGABRT, Aborted. 0x00007ffff69f05f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 56 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); (gdb) bt

0 0x00007ffff69f05f7 in __GI_raise (sig=sig@entry=6)

at ../nptl/sysdeps/unix/sysv/linux/raise.c:56

1 0x00007ffff69f1ce8 in __GI_abort () at abort.c:90

2 0x00000000006a50a4 in rr::FatalOstream::~FatalOstream (this=0x7fffffffde59,

__in_chrg=<optimized out>) at /global/homes/k/kfischer/rr/src/log.cc:264

3 0x00000000006b41da in rr::start_counter (tid=2600, group_fd=-1,

attr=0xa4d460 <rr::cycles_attr>)
at /global/homes/k/kfischer/rr/src/PerfCounters.cc:262

4 0x00000000006b4844 in rr::PerfCounters::reset (this=0xa537d8, ticks_period=60253)

at /global/homes/k/kfischer/rr/src/PerfCounters.cc:346

5 0x000000000076533b in rr::Task::resume_execution (this=0xa537b0,

how=rr::RESUME_SYSCALL, wait_how=rr::RESUME_NONBLOCKING, tick_period=60253, sig=0)
at /global/homes/k/kfischer/rr/src/Task.cc:938

6 0x00000000006bef28 in rr::RecordSession::task_continue (this=0xa51560,

step_state=...) at /global/homes/k/kfischer/rr/src/RecordSession.cc:581

7 0x00000000006c4da9 in rr::RecordSession::record_step (this=0xa51560)

at /global/homes/k/kfischer/rr/src/RecordSession.cc:1951

8 0x00000000006bbd26 in rr::record (

args=std::vector of length 1, capacity 2 = {...}, flags=...)
at /global/homes/k/kfischer/rr/src/RecordCommand.cc:314

9 0x00000000006bc179 in rr::RecordCommand::run (

this=0xa4d610 <rr::RecordCommand::singleton>, 
args=std::vector of length 1, capacity 2 = {...})
at /global/homes/k/kfischer/rr/src/RecordCommand.cc:381

10 0x000000000078545e in main (argc=3, argv=0x7fffffffe458)

at /global/homes/k/kfischer/rr/src/main.cc:270

Keno commented 7 years ago

Thanks. The perf list output explains the problem. We need the retired branch counter, which does not seem to be available. Odd. I thought usually perf counters were pretty much an all-or-nothing deal when it comes to virtualization. Perhaps an old VMWare version? Any chance you could try on bare metal?

paf-49 commented 7 years ago

grumble, grumble Yes, it's a version issue. I'll spare you the boring details (just spent a few minutes trying to enable the feature).

I can try it on real hardware, but might have glibc issues again...

paf-49 commented 7 years ago

On hardware (Cray SLES 12, probably not too different from Cori): ./rr record ./flock rr: Saving execution to trace directory /root/.local/share/rr/flock-0'. [FATAL /global/homes/k/kfischer/rr/src/AddressSpace.cc:286:map_rr_page() errno: SUCCESS] (task 23650 (rec:23650) at time 14) -> Assertionchild_fd == -EACCES' failed to hold. Unexpected error mapping rr_page Launch gdb with gdb '-l' '10000' '-ex' 'target extended-remote :23650' /cray/css/u18/paf/shared/rr/flock

Sits there...

Tried gdb as suggested: Reading symbols from /cray/css/u18/paf/shared/rr/flock...done. Remote debugging using :32227 warning: limiting remote suggested packet size (17073526 bytes) to 16384 Remote connection closed (gdb) quit

Output of rr: rr: /global/homes/k/kfischer/rr/src/GdbConnection.cc:540: std::string rr::read_target_desc(const char*): Assertion `f' failed. Aborted

paf-49 commented 7 years ago

Note that I am able to execute the 'flock' binary correctly by itself.

rocallahan commented 7 years ago

@Keno might it be easier for you to create a standalone testcase? Shouldn't be that hard, probably doesn't even need ptrace, just have one process call flock while another process sends it an ignored signal?

verygreen commented 7 years ago

I am getting:

rr: Saving execution to trace directory `/root/.local/share/rr/flock-0'.
[FATAL /global/homes/k/kfischer/rr/src/PerfCounters.cc:156:get_cpu_microarch() errno: ENOTTY] CPU 0x620 unknown.
Aborted

Keno commented 7 years ago

@rocallahan Yes, I had started on that, but I didn't quite manage to reproduce it. I was hoping at this point we'd be robust enough to be able to have people run it ;) - wishful thinking I guess. I'll try again to do a standalone test case.

paf-49 commented 7 years ago

When you say "ignored signal", do you mean "blocked in the signal mask", or ...?

rocallahan commented 7 years ago

Just a signal with no handler and SIG_IGN disposition.

rocallahan commented 7 years ago

But apparently the problem is more complicated than that. Maybe you need ptrace as well.

paf-49 commented 7 years ago

Huh. I wasn't familiar with SIG_IGN... Lustre (and perhaps GPFS too) does interruptible waits in some places, and (usually) returns -ERESTARTSYS, making the interrupted call restartable. Do you have SA_RESTART set? If not, then you'll get -EINTR on interrupts. (Can you even set that when you're not writing your own handler? I would think so)

paf-49 commented 7 years ago

To clarify: When SA_RESTART is set, returning -ERESTARTSYS from Lustre causes the kernel to restart the syscall in question. If SA_RESTART is not set, then the kernel translates that to -EINTR and returns it back to userspace. (In certain cases Lustre returns -EINTR directly, causing SA_RESTART to be ignored)

verygreen commented 7 years ago

Yed, I would think if you have any sort of a signal during a blocking system call, ou might get EINTR and you need to handle it one way or another. though flock is kind of a bad case since it restarts the whole thing and with RPCS potentially taking long time it might never complete as the result spinning in the retry loop forever (should not be even lustre specific as long as the wait is interruptible).

rocallahan commented 7 years ago

SA_RESTART only applies when you set a handler. Ignored signals should not cause a syscall to return EINTR.

verygreen commented 7 years ago

yes, I guess ignored signals really should not unless they could not be ignored or something. I wonder what signal is this that's actually getting through

paf-49 commented 7 years ago

Hm. @rocallahan I'm digging through kernel source to look at the restart handling wrt SIG_IGN. I mean, if you interrupt a syscall, which needs to be possible, then it has to be restarted. We can't just wait uninterruptibly for network things which may not complete. (That's probably a difference between distributed file systems and local ones. I'm betting most local ones don't wait interruptibly.)

So, ignored is not the same thing as blocked. (Why not just block...?)

rocallahan commented 7 years ago

rr arranges for an ignored SIGSTKFLT to be delivered pretty much every time a syscall blocks in the kernel. This does not cause any other syscalls to return EINTR (if it did, we'd be hitting this here bug all the time) and it doesn't cause EINTR for flock in common filesystems like ext4 or btrfs (or tests would be failing on those filesystems, which are tested often). That's why we suspect filesystem-specific kernel issues here.

We can't block the signal because rr needs to get a blocking notification (via ptrace) that the signal is being delivered. rr responds to that notification by resuming execution of the tracee, letting it complete the syscall (and meanwhile scheduling another tracee).

rocallahan commented 7 years ago

This does not cause any other syscalls to return EINTR (if it did, we'd be hitting this here bug all the time)

Correction, it doesn't cause EINTR in the set of syscalls that we have fast-paths for. And we do know that epoll_wait causes unwanted EINTRs.

paf-49 commented 7 years ago

I think the "filesystem-specific" issue here is that we're interruptible, so we're getting interrupted. The syscall is getting interrupted. I suspect that's not true of btrfs or ext4, since they wait uninterruptibly, basically working on the assumption that nothing they're waiting for will fail to return in a reasonable time frame.

Being network file systems, Lustre and GPFS can't do this realistically. (There's a bit more to it, but that's the basic thing.)

Still poking around in the kernel to try to understand stuff here better...

rocallahan commented 7 years ago

I suspect that's not true of btrfs or ext4, since they wait uninterruptibly, basically working on the assumption that nothing they're waiting for will fail to return in a reasonable time frame.

That may be true, but other interruptible syscalls (e.g. read on a pipe or socket) wait in such a way that they don't return EINTR for ignored signals.

paf-49 commented 7 years ago

Hm, all right.

My digging around strongly suggests that in the case of SIG_IGN, the signal shouldn't truly be delivered at all. ptrace_signal is called to let it look at the attempted delivery, but after that, we don't call the actual handling code, which is where ERESTARTSYS is handled. (I think... It gets very hairy in here.)

(do_signal-->get_signal-->get_signal_to_deliver and then handle_signal)

Since we can't reproduce the problem yet, would @Keno be able to try (just for our information) setting sa_flags to SA_RESTART? (Since there is a struct sigaction when you're setting SIG_IGN.)

Keno commented 7 years ago

Ok try the following maybe: https://gist.github.com/Keno/f257142b20d212b94182058ecee363af Results for me are as follows: tmpfs/ext4/btrfs: ok GPFS-via-DataWarp: ok GPFS: Reproduces failure reported here Lustre: Hangs

So not quite the same behavior as in the original test case on lustre, but close enough to reproduce the problem?

paf-49 commented 7 years ago

Well, on Lustre, I'd call that hang "expected". When waiting, Lustre sometimes blocks everything and otherwise it blocks all signals it considers non-fatal. (The hang is the parent signalling the child (while the child is waiting for an fcntl call that will not complete, because the parent holds a conflicting lock) and then waiting for the child to change state. This state change doesn't happen because the signal is blocked.)

The signal state on the child process can be seen in /proc//status (relevant lines pointed out below): paf@crystal:/proc/12839> cat status Name: a.out State: S (sleeping) Tgid: 12839 Ngid: 0 Pid: 12839 PPid: 12838 TracerPid: 12838 Uid: 10213 10213 10213 10213 Gid: 12790 12790 12790 12790 FDSize: 64 Groups: 1007 1013 1113 8001 12534 12584 12790 VmPeak: 4088 kB VmSize: 4088 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 112 kB VmRSS: 112 kB VmData: 44 kB VmStk: 164 kB VmExe: 8 kB VmLib: 1788 kB VmPTE: 28 kB VmSwap: 0 kB Threads: 1 SigQ: 1/127523 SigPnd: 0000000000000000 ShdPnd: 0000000020000000 <--- PENDING SigBlk: ffffffffffff9ef9 <--- BLOCKED SigIgn: 0000000000000000 SigCgt: 0000000000000000 CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: 0000001fffffffff Seccomp: 0 Cpus_allowed: ffff Cpus_allowed_list: 0-15 Mems_allowed: 00000000,00000001 Mems_allowed_list: 0 voluntary_ctxt_switches: 4 nonvoluntary_ctxt_switches: 2

If you switch to a signal that Lustre considers fatal and isn't blocking, the child will wake up (and die, since there's no handling set up).

So, a different bit of weirdness in signals and Lustre. Not the same one we see with rr.

paf-49 commented 7 years ago

Ah, after parsing through the various asserts, etc, it looks like sending SIGINT to the child will get the -1 and errno=EINTR behavior you mentioned.

Setting up a sigaction with SIG_IGN doesn't appear to do anything. So I think I've reproduced the problem(?)...

paf-49 commented 7 years ago

OK, yes. When we're not ptraced, the child process exits when sent SIGINT, as I'd expect, and hangs when sent '30', also more or less as expected (since it's blocked).

When I set SIG_IGN as the action for SIGINT (on the child), it hangs. No -1 and EINTR returned, syscall doesn't exit. This is when not ptraced.

When ptraced, I get -1 and EINTR (on SIGINT), whether or not I've set up the sigaction. IE, it seems like the sigaction is getting ignored when we're ptraced.

paf-49 commented 7 years ago

So the question then, I suppose, is what's different about EXT4 and friends? I suspect it's something in the handling of waiting in TASK_UNINTERRUPTIBLE...

Because they must trigger the state change the parent is waiting for after sending the signal, without actually exiting the syscall. I'm guessing it's "successful delivery" when a signal is given to a process waiting in TASK_UNINTERRUPTIBLE (which doesn't care that you signalled it).

So... Lustre is interruptible. That's by design. It's getting interrupted...

And when ptraced, the sigaction seems to be ignored. That doesn't feel like a Lustre bug.

Thoughts?

rocallahan commented 7 years ago

Getting EINTR in this testcase is a bug, because the ptracer never delivers the signal to the tracee. I don't know which part of the kernel to blame.

hangs when sent '30', also more or less as expected (since it's blocked).

Blocked by what? User-space hasn't blocked it.

I don't think we should invest too much in trying to determine whether this is a kernel bug or not. We generally work around kernel bugs anyway, and I think it would be effective to simply detect if certain filesystems are mounted and disable buffering of flocks in those cases, which should have no real downsides given those syscalls are already expensive on these filesystems.

paf-49 commented 7 years ago

Blocked by what? User-space hasn't blocked it.

Blocked by Lustre.

And I'm certainly happy to let it go with a workaround, if that works for rr. (My area is Lustre and the kernel, so I'm inclined to dig deep.) I may keep investigating in the interests of fixing the ptrace/Lustre interaction, but I certainly don't have to do it here.

Keno commented 7 years ago

If you're willing to keep looking into the lustre/kernel side of things, I'd certainly be happy to keep helping out any way I can. We should of course still investigate the possibility of a workaround, to fix this for the immediate future (and even if we find a kernel fix to fix it for older versions).

rr-debugger / rr

flock test fails on lustrefs #1908

0 0x00007ffff69f05f7 in __GI_raise (sig=sig@entry=6)

1 0x00007ffff69f1ce8 in __GI_abort () at abort.c:90

2 0x00000000006a50a4 in rr::FatalOstream::~FatalOstream (this=0x7fffffffde59,

3 0x00000000006b41da in rr::start_counter (tid=2600, group_fd=-1,

4 0x00000000006b4844 in rr::PerfCounters::reset (this=0xa537d8, ticks_period=60253)

5 0x000000000076533b in rr::Task::resume_execution (this=0xa537b0,

6 0x00000000006bef28 in rr::RecordSession::task_continue (this=0xa51560,

7 0x00000000006c4da9 in rr::RecordSession::record_step (this=0xa51560)

8 0x00000000006bbd26 in rr::record (

9 0x00000000006bc179 in rr::RecordCommand::run (

10 0x000000000078545e in main (argc=3, argv=0x7fffffffe458)