rogpeppe / go-internal

Selected Go-internal packages factored out from the standard library
BSD 3-Clause "New" or "Revised" License
823 stars 67 forks source link

testscript: fix "signal: killed" exec errors on MacOS by adding a 200ms sleep on darwin #219

Closed bep closed 1 year ago

bep commented 1 year ago

On MacOS there are lots of reports of unexpected failing tests with output similar to this:

 [signal: killed]
            FAIL: testscripts/myecho.txt:1: unexpected command failure

On CI builds the workaround has been to downgrade to a builder with MacOS <= 11 (e.g. macos-11 on GitHub).

In development on a MacBook, this is not an option.

This commit works around what seem to be a upstream bug in os.Link by adding a small sleep before TestingM.Run() to allow the write of the test command symlinks to be ready.

See #200

uname -a                                                                                                             
Darwin bep-mac14.local 22.3.0 Darwin Kernel Version 22.3.0: Thu Jan  5 20:48:54 PST 2023; root:xnu-8792.81.2~2/RELEASE_ARM64_T6000 arm64
go version                                                                                                           
go version go1.20.3 darwin/arm64

Fixes #200

mvdan commented 1 year ago

Thanks for digging into this! It's great to have someone with a machine that is able to reproduce do this investigation, as I was unable to.

// Make sure all of the io operations above gets some time to complete.

I'm curious, what would make them not finish in time? We wait for the io.Copy and os.File.Close calls to finish, which I assume are enough. I guess we could also call https://pkg.go.dev/os#File.Sync, but at least in my past experience, that has never been needed.

How did you arrive at the 50ms? Is that number long enough to make the failures virtually impossible on your laptop? I'm almost tempted to double or even quadruple that number if so, because CI runners are often many times slower than a modern laptop.

I'd also like to have this sleep behind a runtime.GOOS == "darwin"; the other platforms are entirely unaffected.

Also, interesting that CI still failed with the same error on Mac :)

Finally - it would be good to still figure out a way to make this work well without sleeping. Sleeping is always going to be either too short and flakey, or too long and wasteful. If there really isn't anything we can do but sleep, perhaps we could do something a bit better, such as sleeping in a loop until calling the last of the commands as a no-op (e.g. with -h) results in a normal execution.

mvdan commented 1 year ago

Also note that the failure in CI above is from executing go, so it's in theory entirely unrelated to the programs we set up in $PATH. I've always wondered if the cause was that testscript is parallel by default, and executing so many binaries concurrently might make MacOS unhappy.

bep commented 1 year ago

How did you arrive at the 50ms? Is that number long enough to make the failures virtually impossible on your laptop? I'm almost tempted to double or even quadruple that number if so, because CI runners are often many times slower than a modern laptop.

That's a rather arbitrary number. I did test higher, but not lower numbers. My test case is pretty aggressive and I can easily run that may times without any error -- but this is the setup part of the scripts, so there's no reason not to go for a longer sleep.

Finally - it would be good to still figure out a way to make this work well without sleeping.

Sure, but there's a lot of of file copying and mkdiring in RunMain, so digging into that sounds like ... work. My best guess is file syncing, but in any case there's certainly a data race going on where the binaries gets executed before they're somehow written to disk (or a variation of that theme).

I'd also like to have this sleep behind a runtime.GOOS == "darwin"; the other platforms are entirely unaffected.

I have force pushed (bad habit ...) a version with 200ms sleep + a runtime.GOOS == "darwin" conditional. Let's see if the test passes (the failing test must be unrelated).

ldemailly commented 1 year ago

That's awesome if it works, can you try File.Sync instead of the sleep (if possible) to see if that solves it as well? (I guess the issue is there are a lot of Close() and it's not all in one place)

bep commented 1 year ago

That's awesome if it works, can you try File.Sync instead of the sleep (if possible) to see if that solves it as well? (I guess the issue is there are a lot of Close() and it's not all in one place)

I'll see if I can find some time during the day. So, it fixes my problem, so in that sense it works. I run tests on save in my editor, and before this patch I get this error at about 1/4 of the test runs (depending on the size of the project under test), and have to rerun the test, which breaks my workflow pretty bad. Whether this fixes the CI problem is yet to be proven, but the symptoms are the same, so I would be surprised if it didn't.

bep commented 1 year ago

OK, I tried a variant to add a Sync, but that didn't fix it. Turns out the problem lives here:

https://github.com/rogpeppe/go-internal/blob/master/testscript/exe.go#L126

mvdan commented 1 year ago

Interesting, I wouldn't have guessed that the hard links were to blame. Note that they're hard links though, not symbolic links. I can't quite remember why I chose os.Link rather than os.Symlink, but I imagine it's because hard links are much closer to real files than symbolic links. And symbolic links always have their own set of problems, like breaking if the target is removed.

Disabling symlinks above (like it's done on Windows) would be a worse fix than this patch, so I suggest merging this and then create an issue in Go that can (maybe) do a better fix in the future.

I would actually say the opposite. Copying a binary will generally take much, much less than 50ms or 200ms. A hard link is typically even faster, yes, but here for MacOS we're aiming for avoiding the failures over speed. If disabling hard links on Mac fixes the issue, like we already do on Windows, then I would say that's definitely a better fix than an arbitrary sleep.

bep commented 1 year ago

Note that they're hard links though, not symbolic links. I can't quite remember why I chose os.Link rather than os.Symlink

OK, my mistake.

Copying a binary will generally take much, much less than 50ms or 200ms.

You're duplicating the binary n times (where n is the number of commands in the map, which can be many). If you have many commands/big binaries, I doubt you can beat 200ms, but it doesn't matter much.

I have created this PR:

https://github.com/rogpeppe/go-internal/pull/220

Which also fixes this issue for me. Pick the one you like.

mvdan commented 1 year ago

Both have passed CI now, so they both seem to help. I'm inclined to go with the other PR, just because sleeps are always going to be flakey or slow. I'll leave both PRs up for 24h to see if other Mac users chime in.

bep commented 1 year ago

@mvdan note that you should also bump the macos version here:

https://github.com/rogpeppe/go-internal/blob/master/.github/workflows/test.yml#L19

But I'm pretty sure that's not picked up from a PR from a fork, so you should do that as a follow up. As I said, I'm 100% sure that this fixes my problem, but I'm not 100% sure that it also fixes all CI related problems.

bep commented 1 year ago

@mvdan I had a second look on this, and I think I want to put in a vote for the "sleep fix". I checked the bin folders for a test I have running:

This patch:

41M

The other patch:

493M

The way the binaries are created are very clever, but it means lots of potentially very large files created on each run if your not able to hard link it.

mvdan commented 1 year ago

Yeah, more than happy to bump the macos version back to latest once we're reasonably sure that the flakes are solved.

The way the binaries are created are very clever, but it means lots of potentially very large files created on each run if your not able to hard link it.

That's true, although these are temporary files, so does the size matter that much? Speaking for myself, a 200ms sleep on every test run would bother me more than using another 50-100MiB worth of /tmp per run. Most of the testscript use cases I've seen in the past declare at most a handful of top-level commands.

I'd of course love to find a solution to continue using hard links. Sleeping doesn't seem like a good solution :) If we knew what to poll for, or somehow tell when the binaries are "ready" to be executed, then at least we could incrementally sleep until done.

bep commented 1 year ago

That's true, although these are temporary files, so does the size matter that much? Speaking for myself, a 200ms sleep on every test run would bother me more than using another 50-100MiB worth of /tmp per run.

You understand that my numbers above are 50MiB vs 500MiB, right? I have tests that 1. I run on save in my editor and 2. Typically takes a second or two to finish (so I don't care too much about 200ms extra). For arguments sake, let's say I have a bigger project with more commands so it becomes 1GB on every save in my editor, that's a lot of SSD writes during a year ...

Also, knowing this is a issue with os.Link we can

rogpeppe commented 1 year ago

@bep You've probably tried this already, but given that you can reproduce this problem reasonably reliably, it might be interesting to try stat-ing and reading the contents of the link just after creating it. I'm specifically wondering if it can ever happen that that succeeds when the later exec fails. If not, then this could be used as the basis for a polling loop, assuming the availability is a global property of the linked file.

Another more direct way to implement a polling loop is actually to try to exec the commands immediately after creating the links. We could set some environment variable to make the action of the command a no-op in this case.

(For the record, I'm not really in favour of either the sleep or the copy-the-binary solution, unless those really are the only way forward).

bep commented 1 year ago

@rogpeppe I have tried a few variations of stating and opening and reading some bytes from the file, but none of those work. I guess one could try to read the entire file in a poll loop, but...

However, I have tested this successfully, which is now my current favourite solution:

https://github.com/rogpeppe/go-internal/pull/221

ldemailly commented 1 year ago

Would symlinks instead of hardlinks work better/out of the box? I vaguely remember issues with hardlinks turning to copies when crossing fs and maybe tmpdir is different fs?

mvdan commented 1 year ago

@ldemailly see https://github.com/rogpeppe/go-internal/pull/221.

mvdan commented 1 year ago

@rogpeppe I have tried a few variations of stating and opening and reading some bytes from the file, but none of those work. I guess one could try to read the entire file in a poll loop, but...

Thanks for checking those. If symlinks don't have this race problem, then hopefully that's the easiest, as we don't need any sleeping nor polling.