golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.01k stars 17.54k forks source link

os: (*Cmd).Wait hangs on macOS when setctty is used after commit b15c399a3 #61779

Open FiloSottile opened 1 year ago

FiloSottile commented 1 year ago

A test that involves setting the command's TTY with the code below, and then calling (*Cmd).Wait, started timing out on macOS in Go 1.20.

func SetCtty(cmd *exec.Cmd, tty *os.File) {
    cmd.SysProcAttr = &syscall.SysProcAttr{
        Setctty: true,
        Setsid:  true,
        Ctty:    3,
    }
    cmd.ExtraFiles = []*os.File{tty}
}

I bisected it down to b15c399a36a38509ae56dd69670974566f7b0d52.

commit b15c399a36a38509ae56dd69670974566f7b0d52
Author: Yuval Pavel Zholkover <paulzhol@gmail.com>
Date:   Sat Jul 30 20:41:58 2022 +0300

    os: only add file descriptors which are set to non-blocking mode to the netpoller

    Either ones where kind == kindNonBlock or those we've successfully called syscall.SetNonblock() on.
    Restore blocking behavior if we detect an error registering with the netpoller and our flow was
    successful in setting the inital syscall.SetNonblock().

    Update #54100

    Change-Id: I08934e4107c7fb36c15a7ca23ac880490b4df235
    Reviewed-on: https://go-review.googlesource.com/c/go/+/420334
    TryBot-Result: Gopher Robot <gobot@golang.org>
    Reviewed-by: Dmitri Goutnik <dgoutnik@gmail.com>
    Reviewed-by: Ian Lance Taylor <iant@google.com>
    Run-TryBot: Yuval Pavel Zholkover <paulzhol@gmail.com>
    Reviewed-by: Than McIntosh <thanm@google.com>
    Auto-Submit: Ian Lance Taylor <iant@golang.org>

 src/os/file_unix.go | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

You can see the failing tests in the CI of rogpeppe/go-internal#172 (https://github.com/rogpeppe/go-internal/actions/runs/5005273594?pr=172) or in age's tests (https://github.com/FiloSottile/age/actions/runs/5633140339/job/15261730363).

/cc @paulzhol @ianlancetaylor @dmgk #54100

ianlancetaylor commented 1 year ago

I'm sure I'm doing something dumb but I can't find the sources for the TestScripts/pty test. I cloned https://github.com/FiloSottile/go-internal but I don't see that test.

If the cause is https://go.dev/cl/420334 then my first guess would be that Darwin kqueue does not permit adding a pty. But I don't understand what the pty is being used for. It looks like the subcommand stdin, stdout, and stderr will all be pipes.

mvdan commented 1 year ago

@ianlancetaylor that testdata file is added in the PR https://github.com/rogpeppe/go-internal/pull/172/files, so you can find it at the branch https://github.com/FiloSottile/go-internal/tree/filippo/pty.

FiloSottile commented 1 year ago

But I don't understand what the pty is being used for. It looks like the subcommand stdin, stdout, and stderr will all be pipes.

Yes, fds 0, 1, and 2 will be pipes, but the ctty, passed as an extra fd and accessed as /dev/tty by the subcommand, will be a pty.

(This is testing the situation where stdin and stdout/err are redirected, but the command can use the tty to prompt the user.)

ianlancetaylor commented 1 year ago

I don't fully understand what is happening.

It's clear that macOS does not permit opening /dev/tty and passing the resulting descriptor to kevent.

Before CL 420334 we would first try to add the descriptor to kqueue, and if that succeeded put the descriptor into non-blocking mode. After that CL we would first put the descriptor into non-blocking mode, and then try to add it to kqueue. We made that change because of #54100, in which we could sometimes add a descriptor to kqueue and then fail to make it non-blocking, which is bad. So now we first set the descriptor to non-blocking, and then try to add it to kqueue. If we fail to add it to kqueue, we put the descriptor back into blocking mode.

That is the sequence that happens for macOS with /dev/tty. It seems fine. But for some reason that I don't understand, opening /dev/tty, setting it to non-blocking mode, then setting it back to blocking mode, causes an inexplicable problem: the program hangs when calling _exit. Unfortunately I can't run dtruss on the system I am using, so I don't know precisely which system call is hanging. I don't know whether this is a libsystem bug or a kernel bug.

It is also possible that the problem arises because we put /dev/ptmx or the /dev/ttysNNN device into non-blocking mode.

I have not been able to write my own test case for this, although I can see that the test described above fails. I'm not sure what is different, other than that I am using the internal/testpty package from the standard library.

I have a simple fix that fixes this issue. Maybe we should just go with it. I'm not sure.

gopherbot commented 1 year ago

Change https://go.dev/cl/517555 mentions this issue: os: don't add character devices to kqueue on darwin

paulzhol commented 1 year ago

I don't have access to the hardware, but darwin has a test suite which sets the returned openpty() fd's to O_NONBLOCK, adds them to various kqueues and does read/write calls on them.

I'd love to see some system call traces showing whether the pair of opened ptys in the parent fail to be added to kqueue or it is the opening of /dev/tty in the child (which could be not having a controlling terminal anyway) which fails.

FreeBSD man page on tty devices has this section regarding TIOCNOTTY/TIOCSCTTY and /dev/tty:

The current system does not allocate a controlling terminal to a process on an open() call: there is a specific ioctl called TIOCSCTTY to make a terminal the controlling terminal. In addition, a program can fork() and call the setsid() sys- tem call which will place the process into its own session - which has the effect of disassociating it from the control- ling terminal. This is the new and preferred method for pro- grams to lose their controlling terminal.

However, environmental restrictions may prohibit the process from being able to fork() and call the setsid() system call to disassociate it from the controlling terminal. In this case, it must use TIOCNOTTY.

Maybe the child OSX process needs to set TIOCSCTTY on /dev/tty as well?

This does not explain the blocking of _exit of course.

bcmills commented 10 months ago

The next time someone sees a failure mode matching this issue, could you paste a goroutine dump from the timed-out process directly into the GitHub issue? The CI links from the original post seem to have expired.

FiloSottile commented 3 months ago

This is still an issue in Go 1.22. https://github.com/FiloSottile/age/tree/tmp/61779 has tests that fail on macOS.

$ go test ./cmd/age -run TestScript/scrypt -timeout 5s
panic: test timed out after 5s
running tests:
    TestScript/scrypt (5s)

goroutine 51 [running]:
testing.(*M).startAlarm.func1()
    /opt/homebrew/Cellar/go/1.22.0/libexec/src/testing/testing.go:2366 +0x30c
created by time.goFunc
    /opt/homebrew/Cellar/go/1.22.0/libexec/src/time/sleep.go:177 +0x38

goroutine 1 [chan receive]:
testing.(*T).Run(0x1400012e820, {0x1043ed148?, 0x140001697d8?}, 0x1044ba1c0)
    /opt/homebrew/Cellar/go/1.22.0/libexec/src/testing/testing.go:1750 +0x32c
testing.runTests.func1(0x1400012e820)
    /opt/homebrew/Cellar/go/1.22.0/libexec/src/testing/testing.go:2161 +0x40
testing.tRunner(0x1400012e820, 0x140001698f8)
    /opt/homebrew/Cellar/go/1.22.0/libexec/src/testing/testing.go:1689 +0xec
testing.runTests(0x1400011c438, {0x1046009b0, 0x1, 0x1}, {0x140001699b8?, 0x10429fbbc?, 0x10460aee0?})
    /opt/homebrew/Cellar/go/1.22.0/libexec/src/testing/testing.go:2159 +0x3b0
testing.(*M).Run(0x14000120e60)
    /opt/homebrew/Cellar/go/1.22.0/libexec/src/testing/testing.go:2027 +0x5a4
github.com/rogpeppe/go-internal/testscript.RunMain({0x1044bbe58, 0x14000120e60}, 0x14000169e48)
    /Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/exe.go:100 +0x5ec
filippo.io/age/cmd/age.TestMain(0x14000120e60)
    /Users/filippo/src/filippo.io/age/cmd/age/age_test.go:17 +0xf4
main.main()
    _testmain.go:49 +0x170

goroutine 35 [chan receive]:
testing.tRunner.func1()
    /opt/homebrew/Cellar/go/1.22.0/libexec/src/testing/testing.go:1650 +0x43c
testing.tRunner(0x1400012e9c0, 0x1044ba1c0)
    /opt/homebrew/Cellar/go/1.22.0/libexec/src/testing/testing.go:1695 +0x128
created by testing.(*T).Run in goroutine 1
    /opt/homebrew/Cellar/go/1.22.0/libexec/src/testing/testing.go:1742 +0x318

goroutine 36 [syscall]:
syscall.syscall6(0x14000165708?, 0x104245cac?, 0x90000001010160?, 0x104976588?, 0x90?, 0x1046a0a68?, 0x90?)
    /opt/homebrew/Cellar/go/1.22.0/libexec/src/runtime/sys_darwin.go:45 +0x68
syscall.wait4(0x14000165758?, 0x1042d1c38?, 0x90?, 0x1044b0c60?)
    /opt/homebrew/Cellar/go/1.22.0/libexec/src/syscall/zsyscall_darwin_arm64.go:44 +0x4c
syscall.Wait4(0x140001657f8?, 0x14000165794, 0x0?, 0x10423d6c8?)
    /opt/homebrew/Cellar/go/1.22.0/libexec/src/syscall/syscall_bsd.go:144 +0x28
os.(*Process).wait(0x14000136750)
    /opt/homebrew/Cellar/go/1.22.0/libexec/src/os/exec_unix.go:43 +0x80
os.(*Process).Wait(...)
    /opt/homebrew/Cellar/go/1.22.0/libexec/src/os/exec.go:134
os/exec.(*Cmd).Wait(0x140001742c0)
    /opt/homebrew/Cellar/go/1.22.0/libexec/src/os/exec/exec.go:897 +0x38
github.com/rogpeppe/go-internal/testscript.waitOrStop({0x1044bd5b0, 0x14000178540}, 0x140001742c0, 0xee6608a)
    /Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/testscript.go:1103 +0xc4
github.com/rogpeppe/go-internal/testscript.(*TestScript).exec(0x1400017e488, {0x1043ebb86?, 0x0?}, {0x14000159090?, 0x14000060a68?, 0x104283cc0?})
    /Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/testscript.go:997 +0x4cc
github.com/rogpeppe/go-internal/testscript.(*TestScript).cmdExec(0x1400017e488, 0x0, {0x14000159080, 0x4, 0x4})
    /Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/cmd.go:258 +0x218
github.com/rogpeppe/go-internal/testscript.RunMain.func2(0x1400017e488, 0x0, {0x14000159050, 0x3, 0xb200000000000005?})
    /Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/exe.go:97 +0x168
github.com/rogpeppe/go-internal/testscript.(*TestScript).runLine.func2()
    /Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/testscript.go:710 +0x58
github.com/rogpeppe/go-internal/testscript.(*TestScript).callBuiltinCmd(0x104482160?, {0x14000112a50?, 0x140001bc06a?}, 0x3?)
    /Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/testscript.go:729 +0x58
github.com/rogpeppe/go-internal/testscript.(*TestScript).runLine(0x1400017e488, {0x140001bc06a, 0x12})
    /Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/testscript.go:709 +0x5f4
github.com/rogpeppe/go-internal/testscript.(*TestScript).run(0x1400017e488)
    /Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/testscript.go:606 +0x374
github.com/rogpeppe/go-internal/testscript.RunT.func1({0x1044be310, 0x1400012eb60})
    /Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/testscript.go:343 +0x24c
github.com/rogpeppe/go-internal/testscript.tshim.Run.func1(0x1400012eb60?)
    /Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/testscript.go:234 +0x30
testing.tRunner(0x1400012eb60, 0x140001187f0)
    /opt/homebrew/Cellar/go/1.22.0/libexec/src/testing/testing.go:1689 +0xec
created by testing.(*T).Run in goroutine 35
    /opt/homebrew/Cellar/go/1.22.0/libexec/src/testing/testing.go:1742 +0x318

goroutine 42 [chan send]:
github.com/rogpeppe/go-internal/testscript.waitOrStop.func1()
    /Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/testscript.go:1100 +0x1f8
created by github.com/rogpeppe/go-internal/testscript.waitOrStop in goroutine 36
    /Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/testscript.go:1057 +0xbc
FAIL    filippo.io/age/cmd/age  5.132s
FAIL
ianlancetaylor commented 3 months ago

Does anyone want to pick up https://go.dev/cl/517555 and complete it?

danielskowronski commented 2 months ago

Has it been tested on various iterations of Apple Silicon? I've observed some rare instances of different behaviour on M1 Max vs M2 Pro with the same macOS version (and stock SSH client), but the code is using several layers of abstraction over this low-level aspect and I don't have direct access to those computers.


I just checked that test, which @FiloSottile posted (go1.22.5 darwin/arm64) on machines I have access to:

This disproves my theory with differences between M1 and M2 as I checked it on same CPUs that were giving me original issues with SSH. However, something interesting came up - Apple changed something XNU which aligned PTY behaviour with Linux. Nothing posted in release notes, but those kind of internals would only be visible once they push code to XNU OSS repo (I believe they do so after RC).

If you have any specific code and dumps - I can help :)