jacobsa / fuse

A Go package for implementing a FUSE file system.
Apache License 2.0
478 stars 106 forks source link

Add config option to ignore signal interrupts. #153

Open bjornleffler opened 6 months ago

bjornleffler commented 6 months ago

Being able to ignore SIGURG fixes a number of known issues in GCS Fuse, as well as issue #122 for jacobsa/fuse.

stapelberg commented 6 months ago

I’m not convinced that this is the right fix: if you just don’t cancel the contexts, pending operations will pile up over time.

I’m also not sure I fully understand the actual problem yet, as cancelling the operation should be fine, and the next FUSE request after the interruption should just start another operation, no? Do you have a standalone reproducer for this issue? (I asked for one in issue #122 as well and never got one.)

bjornleffler commented 6 months ago

I agree that this this isn't the perfect fix. That would be to stop all SIGURG signals being delivered in the first place, but I couldn't figure out how to do that:

Cancelling the context is not fine, as it causes user visible application failures. A simple and consistent way to reproduce this is to run "git clone " in a fuse managed directory.

I don't fully understand the issue either. This seems like a Golang thing, not interupts coming from the kernel. What's even more surprising is that the kernel delivers a large number of SIGURG signals. About 50-100 signals for a git checkout with 11 files. Only a handful of those signals result in errors, possibly because the operations had already completed?

bjornleffler commented 6 months ago

The SIGURG signals are quite easy to trigger. This code sample triggers SIGURG every so often.

package main

import ( "log" "os" "os/signal" "syscall" "time" )

func hello() { log.Printf("Hello 1") }

func main() { signalChan := make(chan os.Signal, 1) signal.Notify(signalChan, syscall.SIGURG) go func () { sig := <-signalChan log.Printf("Caught signal! %v", sig) } () go hello() time.Sleep(1 * time.Second) log.Printf("Hello 2") }

bjornleffler commented 6 months ago

Given how:

  1. Golang binaries seems to send lots of SIGURG signals to themselves.
  2. This is user facing breaking issue.
  3. The linux kernel says it's fine to ignore signals: https://goo.gl/H55Dnr "The userspace filesystem may ignore the INTERRUPT requests entirely, ..."

I reckon that this is the easiest way to adress a user facing issue. I deliberately left this as an opt-in feature, so we can test it and see if it causes pending operations to pile up over time.

stapelberg commented 6 months ago

That would be to stop all SIGURG signals being delivered in the first place, but I couldn't figure out how to do that:

I think the best way is to run your program with the environment variable GODEBUG="asyncpreemptoff=1"

Cancelling the context is not fine, as it causes user visible application failures.

Okay, but why? That’s the bit I don’t understand. The kernel’s INTERRUPT messages are supposed to signal that the client is no longer asking for the request to be answered. Is that incorrect? If it’s correct, where does the failure come from?

A simple and consistent way to reproduce this is to run "git clone " in a fuse managed directory.

And what are the steps to do that? Please, supply steps I can run on a clean checkout of jacobsa/fuse without anything else. In particular, I don’t want to deal with Google Cloud accounts, projects, setup, etc. — I think narrowing down this problem to a minimal reproducer will be very valuable in understanding this issue.

jacobsa commented 6 months ago

For what it's worth, I agree with @stapelberg here: it seems like this change needs to come with a detailed discussion of the root cause and why this is the appropriate place to address it. I haven't seen anything in these discussions (or the Google-internal ones) that indicates anybody has tried deeply to understand the root cause.

With regard to the quote from the kernel docs ("The userspace filesystem may ignore the INTERRUPT requests entirely"), I believe this is saying "a file system is free to give a poor user experience by allowing operations to be uninterruptable and potentially continue forever even when the user doesn't want them to continue". Doing that to work around this problem seems like throwing out the baby with the bath water, especially when it seems like it should be possible to deal with this another way or at least understand why it must be dealt with this way.

Below is a similar reply I gave on an internal discussion, with some potential avenues for investigation.

This is of course your call, not mine, but I feel like some more root causing is warranted here. It seems like gcsfuse is doing the correct thing according to the kernel documentation and according to posix: an interrupted file system request is returning EINTR to indicate that it has been interrupted. Changing that, especially with a magic tuning parameter, probably should not be done lightly. I.e. not without fully understanding the problem.

So then the remaining mystery is why this comes up with these programs in particular and why with gcsfuse in particular. Personally I would be trying to answer questions like the following:

  • What is the actual root cause? What signal is being delivered, and why?
  • Why doesn't git display this problem with local file system drivers?
  • Do local file system drivers really not return EINTR on interrupt?
  • What do other fuse server implementations that do support interrupts do?
  • Why doesn't gcsfuse display this problem with other applications?
  • What does posix say git is supposed to do with EINTR?
  • What does git actually do with EINTR?
  • If git is not handling EINTR correctly (a common bug), can we just fix git?
  • Is there some out of band way to find the signal number?
bjornleffler commented 6 months ago

I did a bit more digging today and I'm more confused than before. I still don't know why the operations are interrupted. The interrupt problem seems unrelated to the SIGURG signals. Go routine preemptions are a red herring.

What we do know is that the AWS and Azure equivalents to GCSFuse don't intercept signals in the fuse code and therefore don't experience the user facing problems.

Not to self: The SIGURG signals stop if you set GODEBUG="asyncpreemptoff=1" from the shell, but it doesn't work if you try to use os.Setenv("GODEBUG", "asyncpreemptoff=1") from inside the binary.