Open Keno opened 7 years ago
lfs 2.7.1.11
in case it matters.
I’m looking into the flock test failure on lustrefs. My best is that lustre doesn’t like us interrupting it with the desched sig. How do we usually handle that? Do we explicitly restart the system call somewhere?
We don't handle anything like that currently. The closest I've seen was when trying to add syscall buffering for epoll_wait
, which I think hit the kernel bug referenced here: https://www.varnish-cache.org/lists/pipermail/varnish-commit/2016-January/014928.html. I ended up just abandoning that.
The immediate problem that makes this hard to fix is knowing when the flock
call would really complete if it exits with EINTR
whenever we get descheduled. You could modify the syscallbuf to automatically retry with a traced flock
after EINTR
but of course that might be incorrect if the EINTR
occurred legitimately. A hack that would work OK is to disable flock
syscall buffering if lustrefs is mounted.
Probably should report a lustrefs bug in any case.
Does it make any difference if you change the flock mount option to localflock?
Unfortunately, I cannot change any mount options on this system, so it's not easy for me to try that.
I'm gonna try the KVM setup guide at http://wiki.lustre.org/KVM_Quick_Start_Guide, and see if I can get a dev setup running on one of my machines.
what's Lustre version in use? Is there a way to get the test binary without having to get a lot of dependencies for self building?
As far as I know the version mentioned above is the relevant version here (if not please let me know what command to run). rr has no dependencies other than cmake, so just cloning this repository and doing a standard cmake build should be sufficient. I'd also be happy to do a build and zip up the result, but build from source should be very straightforward. The relevant command to try is in the original post (adjust build location accordingly of course). In my tests it deterministically failed when run in a lustre directory, but passed on all other file systems I've tried (ext4, btrfs, gpfs, temps). Also I was unable to get a lustre dev setup running, so I did not try the suggestion above.
Ah, 2.7.1.11, hm, that's a strange version number, I guss that just means 2.7.1 or something close to it. (cat /proc/fs/lustre/version should be good enough).
The filesystems you tried are all local. Did you try any other network filesystems, like say nfs4 (should be the easiest to setup).
I am going to try this on my local lustre setup and see what happens. What's the distro (I only care due to the kernel), rhel6? rhel7? ubuntu of some sort?
As far as I'm aware gpfs is distributed. Distribution is Cray Linux, with the kernel based on 3.12 as far as I can tell:
kfischer@cori11:~> cat /proc/fs/lustre/version
lustre: 2.7.1.11
kernel: patchless_client
build: 2.7.1.11-trunk-1.0600.f2563c6.3.2-abuild-lustre-filesystem.g
it@f2563c6-2016-10-26-20:44
kfischer@cori11:~> uname -r
3.12.60-52.57.1.11767.0.PTF.996988-default
Ah, sles12 for Cray, I think. Yes, gpfs is distributed, I just missed it in the list. Thanks. I'll try it shortly to see what's going on here.
Also Cray explains why the version is so strange. They roll a bunch of their own patches in and their tree sometimes has significant departures from mainline at times.
hm, it does seem o be pulling a bunch of stuff that I don't have on my test nodes. Can I have just the 64 bit binary that would run on rhel7, please?
Just sticking my oar in so I can get the binary as well... I'm on the Cray Lustre side. (Green is right about the version info ;) )
Also, if you wanted to report this to the Cray site staff so they can open a bug, that would be helpful too... (Even if Green or I figure out the problem and create a patch without you going through Cray, you'll need that bug so we can get the fix installed on Cori.)
Ok, here's the build directory (built on Cori, so it's the same binary I've been using for tests): http://anubis.juliacomputing.io:8844/rr-build.tar.gz. Please let me know if it doesn't work (e.g. due to libc version, etc), in which case I'll spin up a CentOS machine and build one there.
Also I may have been wrong about gpfs. It appears the compute notes on Cori use Cray DataWarp for the home directory, so while the backing file system is gpfs, there's some Cray magic in there as well. I also tried running the test on the login node's home directory, which does appear to be pure gpfs (which is why I thought it would be so on the compute nodes as well) and I was able to reproduce the same behavior I saw with lustre, so this may be a more general problem. My apologies for the incorrect information.
Just out of curiosity, did the problem happen with GPFS-via-DataWarp?
Trying to execute it: [root@cent7c01 bin]# ./rr record flock rr: Saving execution to trace directory `/root/.local/share/rr/flock-2'. [FATAL /global/homes/k/kfischer/rr/src/PerfCounters.cc:261:start_counter() errno: ENOENT] Unable to open performance counter with 'perf_event_open'; are perf events enabled? Try 'perf record'.
Perf record works on this system. Don't have referenced source, so can't easily dig further.
Just out of curiosity, did the problem happen with GPFS-via-DataWarp?
No, that was fine, plain gpfs did appear to have the problem however.
Perf record works on this system. Don't have referenced source, so can't easily dig further.
This was built from unmodified master, so sources are the same as those in this repository. What's /proc/cpuinfo
? Any virtualization techniques in use (some don't preserve the performance counters we need)? Also can you check which counter failed by attaching gdb and getting me a backtrace?
CentOS 7, VMWare ESXi.
I'll get a backtrace momentarily.
Hmm, VMWare is usually fine I thought. Try perf list | grep "Hardware event"
?
[root@cent7c01 ~]# perf list | grep "Hardware event" ref-cycles [Hardware event] stalled-cycles-backend OR idle-cycles-backend [Hardware event] stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]
GDB:
Starting program: /shared/paf/rr-build/bin/./rr record ./flock [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". [New Thread 0x7ffff66b9700 (LWP 2591)] [New Thread 0x7ffff5bb7700 (LWP 2592)] [New Thread 0x7ffff4db5700 (LWP 2593)] [New Thread 0x7fffeffff700 (LWP 2594)] [New Thread 0x7fffef7fe700 (LWP 2595)] [New Thread 0x7fffeeffd700 (LWP 2596)] [New Thread 0x7fffee7fc700 (LWP 2597)] [New Thread 0x7fffedffb700 (LWP 2598)] [New Thread 0x7fffed7fa700 (LWP 2599)] rr: Saving execution to trace directory `/root/.local/share/rr/flock-8'. Detaching after fork from child process 2600. [FATAL /global/homes/k/kfischer/rr/src/PerfCounters.cc:261:start_counter() errno: ENOENT] Unable to open performance counter with 'perf_event_open'; are perf events enabled? Try 'perf record'.
Program received signal SIGABRT, Aborted. 0x00007ffff69f05f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 56 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); (gdb) bt
at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
__in_chrg=<optimized out>) at /global/homes/k/kfischer/rr/src/log.cc:264
attr=0xa4d460 <rr::cycles_attr>)
at /global/homes/k/kfischer/rr/src/PerfCounters.cc:262
at /global/homes/k/kfischer/rr/src/PerfCounters.cc:346
how=rr::RESUME_SYSCALL, wait_how=rr::RESUME_NONBLOCKING, tick_period=60253, sig=0)
at /global/homes/k/kfischer/rr/src/Task.cc:938
step_state=...) at /global/homes/k/kfischer/rr/src/RecordSession.cc:581
at /global/homes/k/kfischer/rr/src/RecordSession.cc:1951
args=std::vector of length 1, capacity 2 = {...}, flags=...)
at /global/homes/k/kfischer/rr/src/RecordCommand.cc:314
this=0xa4d610 <rr::RecordCommand::singleton>,
args=std::vector of length 1, capacity 2 = {...})
at /global/homes/k/kfischer/rr/src/RecordCommand.cc:381
at /global/homes/k/kfischer/rr/src/main.cc:270
Thanks. The perf list
output explains the problem. We need the retired branch counter, which does not seem to be available. Odd. I thought usually perf counters were pretty much an all-or-nothing deal when it comes to virtualization. Perhaps an old VMWare version? Any chance you could try on bare metal?
grumble, grumble Yes, it's a version issue. I'll spare you the boring details (just spent a few minutes trying to enable the feature).
I can try it on real hardware, but might have glibc issues again...
On hardware (Cray SLES 12, probably not too different from Cori):
./rr record ./flock
rr: Saving execution to trace directory /root/.local/share/rr/flock-0'. [FATAL /global/homes/k/kfischer/rr/src/AddressSpace.cc:286:map_rr_page() errno: SUCCESS] (task 23650 (rec:23650) at time 14) -> Assertion
child_fd == -EACCES' failed to hold. Unexpected error mapping rr_page
Launch gdb with
gdb '-l' '10000' '-ex' 'target extended-remote :23650' /cray/css/u18/paf/shared/rr/flock
Sits there...
Tried gdb as suggested: Reading symbols from /cray/css/u18/paf/shared/rr/flock...done. Remote debugging using :32227 warning: limiting remote suggested packet size (17073526 bytes) to 16384 Remote connection closed (gdb) quit
Output of rr: rr: /global/homes/k/kfischer/rr/src/GdbConnection.cc:540: std::string rr::read_target_desc(const char*): Assertion `f' failed. Aborted
Note that I am able to execute the 'flock' binary correctly by itself.
@Keno might it be easier for you to create a standalone testcase? Shouldn't be that hard, probably doesn't even need ptrace, just have one process call flock while another process sends it an ignored signal?
I am getting:
rr: Saving execution to trace directory `/root/.local/share/rr/flock-0'.
[FATAL /global/homes/k/kfischer/rr/src/PerfCounters.cc:156:get_cpu_microarch() errno: ENOTTY] CPU 0x620 unknown.
Aborted
@rocallahan Yes, I had started on that, but I didn't quite manage to reproduce it. I was hoping at this point we'd be robust enough to be able to have people run it ;) - wishful thinking I guess. I'll try again to do a standalone test case.
When you say "ignored signal", do you mean "blocked in the signal mask", or ...?
Just a signal with no handler and SIG_IGN
disposition.
But apparently the problem is more complicated than that. Maybe you need ptrace as well.
Huh. I wasn't familiar with SIG_IGN... Lustre (and perhaps GPFS too) does interruptible waits in some places, and (usually) returns -ERESTARTSYS, making the interrupted call restartable. Do you have SA_RESTART set? If not, then you'll get -EINTR on interrupts. (Can you even set that when you're not writing your own handler? I would think so)
To clarify: When SA_RESTART is set, returning -ERESTARTSYS from Lustre causes the kernel to restart the syscall in question. If SA_RESTART is not set, then the kernel translates that to -EINTR and returns it back to userspace. (In certain cases Lustre returns -EINTR directly, causing SA_RESTART to be ignored)
Yed, I would think if you have any sort of a signal during a blocking system call, ou might get EINTR and you need to handle it one way or another. though flock is kind of a bad case since it restarts the whole thing and with RPCS potentially taking long time it might never complete as the result spinning in the retry loop forever (should not be even lustre specific as long as the wait is interruptible).
SA_RESTART only applies when you set a handler. Ignored signals should not cause a syscall to return EINTR.
yes, I guess ignored signals really should not unless they could not be ignored or something. I wonder what signal is this that's actually getting through
Hm. @rocallahan I'm digging through kernel source to look at the restart handling wrt SIG_IGN. I mean, if you interrupt a syscall, which needs to be possible, then it has to be restarted. We can't just wait uninterruptibly for network things which may not complete. (That's probably a difference between distributed file systems and local ones. I'm betting most local ones don't wait interruptibly.)
So, ignored is not the same thing as blocked. (Why not just block...?)
rr arranges for an ignored SIGSTKFLT to be delivered pretty much every time a syscall blocks in the kernel. This does not cause any other syscalls to return EINTR (if it did, we'd be hitting this here bug all the time) and it doesn't cause EINTR for flock in common filesystems like ext4 or btrfs (or tests would be failing on those filesystems, which are tested often). That's why we suspect filesystem-specific kernel issues here.
We can't block the signal because rr needs to get a blocking notification (via ptrace) that the signal is being delivered. rr responds to that notification by resuming execution of the tracee, letting it complete the syscall (and meanwhile scheduling another tracee).
This does not cause any other syscalls to return EINTR (if it did, we'd be hitting this here bug all the time)
Correction, it doesn't cause EINTR in the set of syscalls that we have fast-paths for. And we do know that epoll_wait
causes unwanted EINTRs.
I think the "filesystem-specific" issue here is that we're interruptible, so we're getting interrupted. The syscall is getting interrupted. I suspect that's not true of btrfs or ext4, since they wait uninterruptibly, basically working on the assumption that nothing they're waiting for will fail to return in a reasonable time frame.
Being network file systems, Lustre and GPFS can't do this realistically. (There's a bit more to it, but that's the basic thing.)
Still poking around in the kernel to try to understand stuff here better...
I suspect that's not true of btrfs or ext4, since they wait uninterruptibly, basically working on the assumption that nothing they're waiting for will fail to return in a reasonable time frame.
That may be true, but other interruptible syscalls (e.g. read
on a pipe or socket) wait in such a way that they don't return EINTR for ignored signals.
Hm, all right.
My digging around strongly suggests that in the case of SIG_IGN, the signal shouldn't truly be delivered at all. ptrace_signal is called to let it look at the attempted delivery, but after that, we don't call the actual handling code, which is where ERESTARTSYS is handled. (I think... It gets very hairy in here.)
(do_signal-->get_signal-->get_signal_to_deliver and then handle_signal)
Since we can't reproduce the problem yet, would @Keno be able to try (just for our information) setting sa_flags to SA_RESTART? (Since there is a struct sigaction when you're setting SIG_IGN.)
Ok try the following maybe: https://gist.github.com/Keno/f257142b20d212b94182058ecee363af Results for me are as follows: tmpfs/ext4/btrfs: ok GPFS-via-DataWarp: ok GPFS: Reproduces failure reported here Lustre: Hangs
So not quite the same behavior as in the original test case on lustre, but close enough to reproduce the problem?
Well, on Lustre, I'd call that hang "expected". When waiting, Lustre sometimes blocks everything and otherwise it blocks all signals it considers non-fatal. (The hang is the parent signalling the child (while the child is waiting for an fcntl call that will not complete, because the parent holds a conflicting lock) and then waiting for the child to change state. This state change doesn't happen because the signal is blocked.)
If you switch to a signal that Lustre considers fatal and isn't blocking, the child will wake up (and die, since there's no handling set up).
So, a different bit of weirdness in signals and Lustre. Not the same one we see with rr.
Ah, after parsing through the various asserts, etc, it looks like sending SIGINT to the child will get the -1 and errno=EINTR behavior you mentioned.
Setting up a sigaction with SIG_IGN doesn't appear to do anything. So I think I've reproduced the problem(?)...
OK, yes. When we're not ptraced, the child process exits when sent SIGINT, as I'd expect, and hangs when sent '30', also more or less as expected (since it's blocked).
When I set SIG_IGN as the action for SIGINT (on the child), it hangs. No -1 and EINTR returned, syscall doesn't exit. This is when not ptraced.
When ptraced, I get -1 and EINTR (on SIGINT), whether or not I've set up the sigaction. IE, it seems like the sigaction is getting ignored when we're ptraced.
So the question then, I suppose, is what's different about EXT4 and friends? I suspect it's something in the handling of waiting in TASK_UNINTERRUPTIBLE...
Because they must trigger the state change the parent is waiting for after sending the signal, without actually exiting the syscall. I'm guessing it's "successful delivery" when a signal is given to a process waiting in TASK_UNINTERRUPTIBLE (which doesn't care that you signalled it).
So... Lustre is interruptible. That's by design. It's getting interrupted...
And when ptraced, the sigaction seems to be ignored. That doesn't feel like a Lustre bug.
Thoughts?
Getting EINTR in this testcase is a bug, because the ptracer never delivers the signal to the tracee. I don't know which part of the kernel to blame.
hangs when sent '30', also more or less as expected (since it's blocked).
Blocked by what? User-space hasn't blocked it.
I don't think we should invest too much in trying to determine whether this is a kernel bug or not. We generally work around kernel bugs anyway, and I think it would be effective to simply detect if certain filesystems are mounted and disable buffering of flocks in those cases, which should have no real downsides given those syscalls are already expensive on these filesystems.
Blocked by what? User-space hasn't blocked it.
Blocked by Lustre.
And I'm certainly happy to let it go with a workaround, if that works for rr. (My area is Lustre and the kernel, so I'm inclined to dig deep.) I may keep investigating in the interests of fixing the ptrace/Lustre interaction, but I certainly don't have to do it here.
If you're willing to keep looking into the lustre/kernel side of things, I'd certainly be happy to keep helping out any way I can. We should of course still investigate the possibility of a workaround, to fix this for the immediate future (and even if we find a kernel fix to fix it for older versions).
Works fine without rr as well as with
-n
. Mount options arerw,flock,lazystatfs
in case it makes a difference.