testscript: "signal: killed" exec errors on MacOS 12

mvdan commented 1 year ago

     --- FAIL: TestScript/flags (0.11s)
        testscript.go:429: > exec shfmt -h
            [signal: killed]
            FAIL: testdata/script/flags.txtar:1: unexpected command failure

I've seen this in a number of projects of mine, like:

@rvagg mentions the same crash in https://github.com/ipld/go-car/pull/364, and in the past, others like @mr-joshcrane have mentioned the same error on Slack.

This must be something going wrong with either testscript or Go, because for example, that TestScript/flags test from above was just running exec shfmt -h, showing the help output from a Go program. You can see that the testscript file is rather boring, so it's not doing anything particularly worrying.

Personally, I've worked around this by downgrading from macos-latest on GitHub Actions (which switched to macos-12 late last year) to macos-11, which seems to make the failures go away entirely. But of course that's not a complete fix.

I first hoped that this would be fixed in Go 1.20 with https://go-review.googlesource.com/c/go/+/460476, and that may still be true, given that there are four distinct os/exec bugs for Mac there. But it's just a good guess, I haven't verified this yet - nor do I have a Mac machine to test with. Help would be appreciated.

The only other recent mentions of "signal: killed" upstream for Mac are https://github.com/golang/go/issues/57418 and https://github.com/golang/go/issues/57239, and they both seem to point to processes being OOM-killed by the system. This could be the case for us as well, perhaps either due to the OS version upgrade changing the OOM behavior, or perhaps because the macos-12 GitHub machines have less available memory. But I'd also find it hard to believe, given that testscript doesn't use a particularly high amount of memory.

Filing this issue to track investigation and progress.

bitfield commented 1 year ago

Just as a data point, I've never managed to reproduce this on macOS 10, despite many attempts.

mvdan commented 1 year ago

I've never seen this error on macos-10 nor macos-11 on GitHub Actions despite having used testscript for years, only on its macos-12, so I'm pretty sure that this bug only happens on MacOS 12.

mvdan commented 1 year ago

Just bit me in the last PR as well: https://github.com/rogpeppe/go-internal/actions/runs/4125273301/jobs/7125604762

--- FAIL: TestSimple (0.05s)
    --- FAIL: TestSimple/cover (3.25s)
        testscript.go:524: # The module uses testscript itself.
            # Use the checked out module, based on where the test binary ran. (0.692s)
            # First, a 'go test' run without coverage. (1.616s)
            # Then, a 'go test' run with -coverprofile.
            # The total coverage after merging profiles should end up being 100%.
            # Marking all printlns as covered requires all edge cases to work well.
            # Go 1.20 learned to produce and merge multiple coverage profiles,
            # so versions before then report a shallow 0% coverage. (0.920s)
            > go test -vet=off -coverprofile=cover.out -v
            [stdout]
            === RUN   TestFoo
                foo_test.go:22: failed to determine release tags from go command: signal: killed

            --- FAIL: TestFoo (0.00s)
            FAIL
            coverage: 0.0% of statements
            exit status 1
            FAIL    test    0.147s

            [exit status 1]
            FAIL: testdata/cover.txt:18: unexpected go command failure

FAIL
FAIL    github.com/rogpeppe/go-internal/gotooltest  3.594s

ldemailly commented 1 year ago

I see this as well

bep commented 1 year ago

I see this as well, but on Ubuntu (ubuntu-latest) with Go 1.20. I needed to remove Go 1.20 from the build matrix of one project ... I will investigate later.

https://github.com/gohugoio/hugoreleaser/issues/38

mvdan commented 1 year ago

@bep perhaps I'm missing something, but I can't find a single mention of either "signal" or "killed" in those logs. The failures seem pretty normal - some commands inside your testscripts are failing when the script expects them to succeed.

bep commented 1 year ago

@mvdan you're right about that, but all the tests (which uses the test script package in this repo) fails with FAIL: testscripts/misc/flags_in_env.txt:1: unexpected command failure (which I also see in the first comment in this issue, which is why I thought there would be a connection). Anyhow: For me, all the tests fail for me on Go 1.20 on Ubuntu (on GitHub) for some reason that seem to come from testscript.Run and not my tests. It sometimes fails with the same error message on my local MacBook with Go 1.20, too (MacOS 13), but it's mostly OK. On Go 1.19 it is all green. I will do some more checking later ...

mvdan commented 1 year ago

"unexpected command failure" is one of the relatively normal error messages you will get with testscript when the tests fail. This particular bug only happens on MacOS 12, and noone has been able to reproduce it on Linux that I am aware of, so it seems very unlikely that you're running into the same bug.

bep commented 1 year ago

"unexpected command failure" is one of the relatively normal error messages you will get with testscript when the tests fail.

As I said, I'm pretty sure these unexpected command failure are ... unexpected and that they come from flakiness in the test infrastructure that's become more flaky in Go 1.20. But you're right, It's probably a different flakines issue; I will create a different issue somewhere once I figure out a little better who to blaim.

mvdan commented 1 year ago

@rogpeppe has been seeing the same failures now on macos-11, in PRs to this very repo. In particular, the OS version appears to be 11.7.4, and the failures happen on both Go 1.19.x and 1.20.x.

Perhaps whatever changed in macos-12 to trigger this bug was backported to macos-11 now.

ldemailly commented 1 year ago

I guess (possibly wrong) it's a background version of what happens when you run a new binary:

followed by "Killed: 9" when clicking ok

bitfield commented 1 year ago

@ldemailly, that sounds like a reasonable idea, but if that's the case, how come it doesn't happen every time, and how come it doesn't happen when you run go test? I mean, that is building and executing a binary in the background. All testscript does is re-exec the same binary, so I can't work out why testscript commands appear to fail frequently (but apparently randomly) on all modern macOS versions, while go test never does.

bep commented 1 year ago

This has become a real issue for me, so I decided to take a look at it, and I found that there's a correlation between the size/amount of commands added to RunMain and this issue. I added a patch in the PR below that fixes this on my MacBook, a sleep after the commands gets written to disk (which I assume allow the OS to flush) and before the test execution.

See https://github.com/rogpeppe/go-internal/pull/219

myitcv commented 1 year ago

Drive-by noting another alternative here to hard linking on macOS: copy on write. That probably requires detection of some sort and a specific system call) but it will be more efficient than a full copy (because there will be no write).

Here is the man page for clonefile2:

https://www.unix.com/man-page/mojave/2/clonefile/

mvdan commented 1 year ago

Thanks, @myitcv. @bep reports that symlinks don't have the problem on MacOS that hard links do, which sort of makes sense: we are not creating new executable files, just symbolic links. Unless anyone disagrees, I think giving symlinks a go is worthwhile. We can then see if they cause any unforeseen issues on either Linux or Mac - I hope not. They should also be slightly more versatile than hard links, as hard links generally require both ends to be on the same filesystem.

mvdan commented 1 year ago

Hit enter too soon :) If the symlink approach fails, then I agree that we could try clonefile before falling back to something like full copies.

mvdan commented 1 year ago

@myitcv correctly points out that I wrote this bit of text in the copyBinary godoc, which does explain why I originally tried symlinks and moved away from them:

// It does not attempt to use symlinks for two reasons:
//
// First, some tools like cmd/go's -toolexec will be clever enough to realise
// when they're given a symlink, and they will use the symlink target for
// executing the program. This breaks testscript, as we depend on os.Args[0] to
// know what command to run.
//
// Second, symlinks might not be available on some environments, so we have to
// implement a "full copy" fallback anyway.

Also worth noting that https://pkg.go.dev/os#Executable says:

If a symlink was used to start the process, depending on the operating system, the result might be the symlink or the path it pointed to. If a stable result is needed, path/filepath.EvalSymlinks might help.

The API makes no guarantees, but if any code currently makes assumptions about the executable being a regular file, that might be silently broken with the change to symlinks.

I'll take another look at my usecase with go build -toolexec to see if it still gets confused by symlinks.

As for the API that Paul suggests, it's at https://pkg.go.dev/golang.org/x/sys/unix?GOOS=darwin#Clonefile. @bep does that reproduce the bug? Presumably it doesn't result in higher disk usage. https://eclecticlight.co/2023/04/28/apfs-hard-links-symlinks-aliases-and-clone-files-a-summary/ explains how the system treats hard links and clone files differently. Given that the Finder UI uses clone files and not hard links by default when making copies of files within the same filesystem, it sounds to me like clone files might behave better with executables.

myitcv commented 1 year ago

As for the API that Paul suggests, it's at https://pkg.go.dev/golang.org/x/sys/unix?GOOS=darwin#Clonefile.

Just to check (because I'm naive in these things), is this definitely the syscall for clone2?

mvdan commented 1 year ago

It's clonefile, the one you linked to above: https://cs.opensource.google/go/x/sys/+/refs/tags/v0.8.0:unix/zsyscall_darwin_amd64.go;l=1011

Where do you see clone2? I only see clonefile(2), but that is just the rather confusing man page syntax to say that clonefile is in the category 2 System calls (functions provided by the kernel), per man man.

bep commented 1 year ago

So, this works:

if runtime.GOOS != "windows" {
    if runtime.GOOS == "darwin" {
        if err := unix.Clonefile(from, to, 0); err == nil {
            return nil
        }
    } else {
        if err := os.Link(from, to); err == nil {
            return nil
        }
    }
}

But it seem to create a copy of the binary[^1]. Or, according to the docs, it shouldn't...

[^1]: I don't worry too much about that anymore, but it may be an issue in the wild. I have figured out that I can move most of my utility stuff into testscript.Params.Cmds which I assume gets compiled into the test binary.

mvdan commented 1 year ago

Full copy on Windows, Clonefile on Mac, and hard links on Linux sound good to me. Beware that we likely need build tags now, since unix.Clonefile is only defined for GOOS=darwin. Perhaps we can add a cloneFile func which is os.Link on unix && !darwin, unix.CloneFile on darwin, and return fmt.Errorf("unavailable") on !unix.

testscript.Params.Cmds don't run as separate programs and processes via os/exec, instead running in-process via a function call. That makes them a lot cheaper indeed. Less powerful as well, but for small test helpers and utilities, I find that they are often enough. In medium-to-large projects, I often end up with 1-3 TestMain commands, often to run the "real" main program in different modes, and 5-15 testscript.Params commands.

bep commented 1 year ago

Full copy on Windows, Clonefile on Mac, and hard links on Linux sound good to me. Beware that we likely need build tags now, since unix.Clonefile is only defined for GOOS=darwin. Perhaps we can add a cloneFile func which is os.Link on unix && !darwin, unix.CloneFile on darwin, and return fmt.Errorf("unavailable") on !unix.

I can prepare a PR with something ala the above in a few hours.

myitcv commented 1 year ago

@bep - thanks very much for digging in here to get this fixed.

ldemailly commented 1 year ago

it's fantastic to have a solution, I'm just a bit confused as to why symlink isn't even better/simpler?

mvdan commented 1 year ago

@ldemailly see the previous comments, particularly https://github.com/rogpeppe/go-internal/issues/200#issuecomment-1536127648.

ldemailly commented 1 year ago

oic "[tools like go] will use the symlink target"

that seems like a bug though if they cause the args0 to change as it's fairly common to rely on that (also that mr did pass all tests so presumably it's actually working ?)

what's the size of the directory when using clonefile?

ldemailly commented 1 year ago

oic, neat, from the man page:

 The cloned file dst shares its data blocks with the src file [...]

 Subsequent writes to either the original or cloned file are private to
 the file being modified (copy-on-write).

except for EXDEV possible issue (are we sure we stay on same FS?)

mvdan commented 1 year ago

There is no guarantee that the source and destination are the same filesystem. Which is why there is a fallback with regular file copying. However, they will often be the same filesystem, as go test and testing both place temporary files under os.TempDir.

ldemailly commented 1 year ago

Thanks for the explanation (that there is a fallback already).

So a symlink, in theory, if not for that (possibly outdated?) go build -toolexec issue would be always working (and smallest disk use).

Either way I think the current solution is great and matches the previous hardlink behavior (fs wise), so it's awesome and indeed I don't see "killed" anymore on my mac 🎉

rogpeppe / go-internal

testscript: "signal: killed" exec errors on MacOS 12 #200