golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.7k stars 17.62k forks source link

os/exec: Start sometimes creates processes with signals blocked with go1.5 on OSX #13164

Closed gabrielrussell closed 8 years ago

gabrielrussell commented 8 years ago

I can recreate this by running a test of our application in a loop. Eventually a process is started that has many of its signals blocked. This ends up causing a error because we're running a process that expects to be able to receive signals.

My attempts to create a smaller program that reproduces this has only failed. I would assume that some bit of the complexity of the application is triggering the bug.

I'll be happy to provide any information that I can.

Gabriel Russell

ianlancetaylor commented 8 years ago

What does the loop that is running the program look like? Is this a loop written in Go or in shell or something else?

How are you testing whether signals are blocked?

gabrielrussell commented 8 years ago

It's a shell script that runs a "go test ..." in a loop. That said, we don't only see this in testing, we also see it on our production systems.

Its a c++ application doing daemonizing double fork where the grandparent doesn't exit until the grandchild sends a signal to the grandparent indicating that it's ready to do work.

We've tested that we can fix the problem by adding code to our c++ application to unblock signals before doing anything else, but this is not a whole solution for us because the c++ application is already widely distributed.

ianlancetaylor commented 8 years ago

I'm sorry, I don't grasp the details here, and the details matter. I see now that your original issue title refers to exec.Start, implying that you are using the os/exec package. You say that you have a shell script running "go test" in a loop.

Is that Go program using os/exec?

How exactly are you starting the program whose signals are blocked?

What does the C++ daemonizing application have to do with this? Is that the test, or is that the production system?

Which signal(s) exactly are you using?

Do you have a small reproduction case that you could share?

Thanks.

gabrielrussell commented 8 years ago

We have an application written in go who's main responsibility is to run another program that's written in c++.

The c++ program needs to to be executed without signals blocked.

Sometimes the go program executes the c++ program with it's signals blocked.

We have tests for our go program, and some of them execute the c++ program.

If we execute one of those tests over and over, via a bash while loop, we eventually see the c++ program get executed with signals blocked.

We need to send and receive SIGUSR2

Our attempts to create a small reproduction case have failed.

Thanks.

On Thu, Nov 5, 2015 at 5:29 PM, Ian Lance Taylor notifications@github.com wrote:

I'm sorry, I don't grasp the details here, and the details matter. I see now that your original issue title refers to exec.Start, implying that you are using the os/exec package. You say that you have a shell script running "go test" in a loop.

Is that Go program using os/exec?

How exactly are you starting the program whose signals are blocked?

What does the C++ daemonizing application have to do with this? Is that the test, or is that the production system?

Which signal(s) exactly are you using?

Do you have a small reproduction case that you could share?

Thanks.

— Reply to this email directly or view it on GitHub https://github.com/golang/go/issues/13164#issuecomment-154215099.

ianlancetaylor commented 8 years ago

Thanks for the details. Does your Go program, the one that executes the C++ program, use cgo?

ianlancetaylor commented 8 years ago

Also, you said this was Darwin, but what is your GOARCH value?

gabrielrussell commented 8 years ago

Yes, there is some cgo.

GOARCH="amd64"

ianlancetaylor commented 8 years ago

I assume your cgo code never calls sigaction or sigprocmask.

Sorry, I realized that may have misunderstood something. When you say that the signals are blocked, what precisely do you mean? Do you mean that the signal is in the blocked signal mask as set by sigprocmask or pthread_sigmask? Or do you mean that the signal is ignored because sigaction was used to set the signal handler to SIG_IGN?

gabrielrussell commented 8 years ago

I assume that they're blocked, ala sigprocmask as opposed SIG_IGN'd.

The piece of code that when inserted into the c++ code masks the problem is:

sigset_t unblockSignalMask;
sigemptyset(&unblockSignalMask);
sigprocmask(SIG_SETMASK, &unblockSignalMask, NULL);

And, I don't think that unblocking a signal should unignore it, but I may be wrong.

Also, we disabled the cgo function calls and re-ran our tests without any change in outcome.

ianlancetaylor commented 8 years ago

I agree that they must be blocked if calling sigprocmask fixes it.

I asked about cgo because a program that uses cgo uses a different code path when creating a new thread. When creating a new thread it is necessary to temporarily block all signals, so that the new thread does not receive a signal before it has registered an alternate signal stack. However, the code uses pthread_sigmask, which only affects the signal mask for the calling thread. My assumption is that execve uses the signal mask of the thread that calls execve, or the process-wide signal mask. Anyhow, that is probably not the issue.

Does your program use the os/signal package at all?

sielaq commented 8 years ago

I'm having similar problem: Linux/amd64, exec.Run() ( or Start() ) I have created 2 app to recreate this: Go: https://gist.github.com/sielaq/5120a4f06cefc2138ed9 C: https://gist.github.com/sielaq/9867be2d1b65ffec658c

Go app starts C app (that kills others C app currently running). When app compiled with go 1.4 -> all C app receive SIGTERM from each other When app compiled with go 1.5 -> sometimes C app cannot trap SIGTERM

to make it easier to reproduce it: bash -c "while true; do ./app; done" try to break it ^C multiple times -> as an effect you should see that C apps start to cumulate in process list.

ianlancetaylor commented 8 years ago

Thanks for the sample code. I suspect this is an unexpected consequence of https://golang.org/cl/10173.

gabrielrussell commented 8 years ago

An interesting thing about that code is that the bash loop is key. It seams that most instances of app have no problems spawning processes and can run seemingly forever without creating a blocked-signal process, but some of the instances of app spawn off many blocked-signal subprocesses.

You have to run app many times before you get a bad one.

sethvargo commented 8 years ago

Hey @ianlancetaylor

Thank you for working on this :smile:. Would it be best to compile applications with Go 1.4 until this is fixed, or is there something I can do code-wise to mitigate this problem? (https://github.com/hashicorp/consul-template/issues/442)

ianlancetaylor commented 8 years ago

I'm not sure I fully understand the problem, and I don't have any suggested mitigation within your Go code.

Of course you could fix it by changing your C program to call sigprogmask, or by interposing a tiny C program that calls sigprocmask and then invokes the real C program.

sethvargo commented 8 years ago

Hi @ianlancetaylor

Sorry for the lack of context :smile:. I have a program that allows the user to specify an arbitrary command to be run when data from an external system changes. In short, it does this:

// Create an invoke the command
cmd := exec.Command(shell, flag, command)
cmd.Stdout = r.outStream
cmd.Stderr = r.errStream
cmd.Env = cmdEnv
if err := cmd.Start(); err != nil {
  return err
}

done := make(chan error, 1)
go func() {
  done <- cmd.Wait()
}()

// ...

https://github.com/hashicorp/consul-template/blob/8a8a9301631244754792e5bf979d28ec6e3a9900/runner.go#L600-L612

In this case, the user is calling haproxy. This worked fine in Go 1.4, but is broken in Go 1.5. I'm just wondering if there's something I can change in my code to work around this behavior instead of telling users to refactor the arbitrary scripts they run.

ianlancetaylor commented 8 years ago

@sethvargo I'm sorry, I understand the question, but I can only repeat my last response: I don't know, except that you can mitigate by using a tiny C helper.

ianlancetaylor commented 8 years ago

OK, I think I've got an idea as to what is happening.

The call sequence is

schedule: called by locked m
stoplockedm
handoffp
startm
newm: allocates m, calls msigsave to save signal mask into m, then calls
newosproc: starts new thread running
mstart
mstart1
minit: sets signal mask based on mask saved by newm

This call sequence gives us a new m that has a signal mask based on the one in the locked M that called schedule. Normally a locked m has an empty signal mask just like every other M, but there is an exception: the goroutine started by ensureSigM. That goroutine, which is locked to a thread, has a signal mask that blocks everything except the signals passed to signal.Notify. So when that goroutine manages to enter schedule at a time when no other m is available, we can get an m with a non-empty signal mask.

The os/exec package doesn't do anything about this, so the signal mask of the m that calls os/exec.Start will provide the signal mask of the process.

I was able to occasionally recreate the problem with this trivial C program in ./foo5:

#include <stdio.h>
#include <signal.h>

int main() {
    sigset_t mask;

    sigemptyset(&mask);
    if (sigprocmask(SIG_SETMASK, NULL, &mask) < 0) {
        perror("sigprocmask");
        return 1;
    }
    if (sigismember(&mask, SIGINT)) {
        fprintf(stderr, "SIGINT is blocked\n");
        return 1;
    }
    return 0;
}

running this Go program that calls the C program:

package main

import (
    "fmt"
    "os"
    "os/exec"
    "os/signal"
    "syscall"
    "sync"
)

func main() {
    c := make(chan os.Signal, 1)
    signal.Notify(c, syscall.SIGTERM)
    go func() {
        for range c {
            fmt.Println("SIGTERM")
        }
    }()

    var wg sync.WaitGroup
    for i := 0; i < 10; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for j := 0; j < 1000; j++ {
                syscall.Kill(os.Getpid(), syscall.SIGTERM)
                go func(iter int) {
                    cmd := exec.Command("./foo5")
                    cmd.Stdout = os.Stdout
                    cmd.Stderr = os.Stderr
                    if err := cmd.Run(); err != nil {
                        fmt.Printf("iteration %d: %v\n", iter, err)
                        os.Exit(1)
                    }
                }(j)
            }
        }()
    }
    wg.Wait()
}
gopherbot commented 8 years ago

CL https://golang.org/cl/18064 mentions this issue.

sielaq commented 8 years ago

@ianlancetaylor well spotted! :trophy:

viciious commented 8 years ago

@ianlancetaylor This issue isn't resolved in 1.5.3 despite the milestone saying so and it seems like it affects Linux as well. 70c9a81 only hit the master branch.

ianlancetaylor commented 8 years ago

Thanks. I marked it 1.5.3 to mean that it should go into a 1.5.3 release if there was one, but then 1.5.3 became a security release with only security fixes. I've changed the milestone to 1.5.4.

viciious commented 8 years ago

@ianlancetaylor Thanks!

clementino commented 8 years ago

I have a similar issue with linux/go1.5.3:

package main

import (
    "os"
    "os/exec"
    "os/signal"
    "syscall"
)

func main() {
     signal.Notify(make(chan os.Signal, 1), syscall.SIGUSR1)
     signal.Reset()
     cmd := exec.Command("bash")
     cmd.Stdin = os.Stdin
     cmd.Stdout = os.Stdout
     cmd.Stderr = os.Stderr
     cmd.Run()
}

Sometimes the bash process is normal (only ignores signal 17, 18, 23 and 28), but some other times it is set to ignore all this signals: 1-3, 6, 12-15, 17, 18, 20-26 and 28-64. If I just import _ "os/signal" without calling a Notify it doesn't happen, but any Notify seems to make it happen, even if I Reset afterward. The frequency varies between my two machines, about half of the times on one and much less frequently on the other.

ianlancetaylor commented 8 years ago

@clementino This issue is closed. The problem should be fixed in the 1.6 release. If not, please open a new issue. If you want to discuss this problem or ask questions, please see https://golang.org/wiki/Questions . Thanks.