windows CI failure - Githubissues

tonistiigi commented 3 years ago

Saw this failure on my PR branch. https://github.com/tonistiigi/buildkit/runs/1119868274 @TBBle

2020-09-15T20:44:37.1037618Z --- FAIL: TestCacheMountLockedRefs (1.78s)
2020-09-15T20:44:37.1043819Z     exec_test.go:343: 
2020-09-15T20:44:37.1069539Z            Error Trace:    exec_test.go:343
2020-09-15T20:44:37.1071822Z            Error:          mount did not unlock
2020-09-15T20:44:37.1072621Z            Test:           TestCacheMountLockedRefs
2020-09-15T20:44:37.1073226Z === CONT  TestMkdirMkfile
2020-09-15T20:44:37.1073729Z --- PASS: TestMkdirMkfile (0.00s)
2020-09-15T20:44:37.1074320Z --- PASS: TestCacheMountSharedRefs (1.62s)
2020-09-15T20:44:37.1074819Z FAIL
2020-09-15T20:44:37.1075296Z FAIL   github.com/moby/buildkit/solver/llbsolver/ops   5.465s

tonistiigi commented 3 years ago

another:

2020-09-17T02:40:34.1535028Z --- FAIL: TestProgress (0.11s)
2020-09-17T02:40:34.1536861Z     progress_test.go:38: 
2020-09-17T02:40:34.1538470Z            Error Trace:    progress_test.go:38
2020-09-17T02:40:34.1539457Z            Error:          Should be true
2020-09-17T02:40:34.1540279Z            Test:           TestProgress
2020-09-17T02:40:34.1607878Z FAIL
2020-09-17T02:40:34.1609224Z FAIL   github.com/moby/buildkit/util/progress  0.245s

https://pipelines.actions.githubusercontent.com/VSeSJsGRiRCTiu0PRaYdcZnubUW6TSVddo0DTZpqG1wDqsIbXb/_apis/pipelines/1/runs/205/signedlogcontent/5?urlExpires=2020-09-17T02%3A43%3A37.7440824Z&urlSigningMethod=HMACV1&urlSignature=TcGSBBBowGfAPBbtrLtlkWNMSlBDkUoe1j1N6PW73kg%3D

TBBle commented 3 years ago

The log URL for TestProgress has expired.

TestCacheMountLockedRefs is not failing on master, so it's not something endemically wrong. The failure is a timeout, so I suspect test parallelism might have caused a missed deadline there. I poked back, and I saw a pass for that test in 2.54s, so it's not 100% time-sensitive, just that last unlock, I suspect.

For TestProgress, I can't see how this would go wrong unless somehow the ctx got Done early. If I get time to poke at this, might be worth fixing the assert.True to be assert.Less so we (I expect) actually see the len it's getting, as "5" or "0" would be two very different likely causes. That said, I don't know the progress package API at all, so I may be misreading that test.

crazy-max commented 10 months ago

Still seeing this one: https://github.com/moby/buildkit/actions/runs/5877167737/job/15936737711?pr=4148#step:5:2053

=== Failed
=== FAIL: solver/llbsolver/mounts TestCacheMountLockedRefs (28.95s)
    mount_test.go:308: 
            Error Trace:    D:/a/buildkit/buildkit/solver/llbsolver/mounts/mount_test.go:308
            Error:          mount did not unlock
            Test:           TestCacheMountLockedRefs
panic: Fail in goroutine after TestCacheMountLockedRefs has completed

goroutine 13 [running]:
testing.(*common).Fail(0xc000059a00)
    C:/hostedtoolcache/windows/go/1.20.7/x64/src/testing/testing.go:933 +0xe5
testing.(*common).Errorf(0xc000059a00, {0x1a49072?, 0xf5c6c7?}, {0xc0001a42f0?, 0x18d91c0?, 0x501?})
    C:/hostedtoolcache/windows/go/1.20.7/x64/src/testing/testing.go:1050 +0x65
github.com/stretchr/testify/assert.Fail({0x221e775deb8, 0xc000059a00}, {0xc0000e8c00, 0x3eb}, {0x0, 0x0, 0x0})
    D:/a/buildkit/buildkit/vendor/github.com/stretchr/testify/assert/assertions.go:333 +0x378
github.com/stretchr/testify/assert.NoError({0x221e775deb8, 0xc000059a00}, {0x1b9bf20, 0xc00008fd28}, {0x0, 0x0, 0x0})
    D:/a/buildkit/buildkit/vendor/github.com/stretchr/testify/assert/assertions.go:1495 +0x105
github.com/stretchr/testify/require.NoError({0x1ba0970, 0xc000059a00}, {0x1b9bf20, 0xc00008fd28}, {0x0, 0x0, 0x0})
    D:/a/buildkit/buildkit/vendor/github.com/stretchr/testify/require/require.go:1357 +0x96
github.com/moby/buildkit/solver/llbsolver/mounts.TestCacheMountLockedRefs.func2()
    D:/a/buildkit/buildkit/solver/llbsolver/mounts/mount_test.go:291 +0x98
created by github.com/moby/buildkit/solver/llbsolver/mounts.TestCacheMountLockedRefs
    D:/a/buildkit/buildkit/solver/llbsolver/mounts/mount_test.go:289 +0x75d

@gabriel-samfira if you're interested to take a look :pray:

gabriel-samfira commented 10 months ago

@crazy-max wrapping up something today and will have a look at this tomorrow morning. I assume this is a flake. Will run this test in a loop and reproduce.

TBBle commented 10 months ago

Based on the panic, I think this is a timeout, hiding the real failure.

Note that earlier, the test failed:

    mount_test.go:308: 
            Error Trace:    D:/a/buildkit/buildkit/solver/llbsolver/mounts/mount_test.go:308
            Error:          mount did not unlock
            Test:           TestCacheMountLockedRefs
--- FAIL: TestCacheMountLockedRefs (28.95s)

i.e. it hit the two-second timeout while waiting for the goroutine at https://github.com/moby/buildkit/blob/2827be6197a73a8f6d2af24b9340ddfde0e75de7/solver/llbsolver/mounts/mount_test.go#L305-L309

The panic that comes later is coming from line 291 (the goroutine) at https://github.com/moby/buildkit/blob/2827be6197a73a8f6d2af24b9340ddfde0e75de7/solver/llbsolver/mounts/mount_test.go#L287-L294

panic: Fail in goroutine after TestCacheMountLockedRefs has completed

goroutine 13 [running]:
testing.(*common).Fail(0xc000059a00)
    C:/hostedtoolcache/windows/go/1.20.7/x64/src/testing/testing.go:933 +0xe5
testing.(*common).Errorf(0xc000059a00, {0x1a49072?, 0xf5c6c7?}, {0xc0001a42f0?, 0x18d91c0?, 0x501?})
    C:/hostedtoolcache/windows/go/1.20.7/x64/src/testing/testing.go:1050 +0x65
github.com/stretchr/testify/assert.Fail({0x221e775deb8, 0xc000059a00}, {0xc0000e8c00, 0x3eb}, {0x0, 0x0, 0x0})
    D:/a/buildkit/buildkit/vendor/github.com/stretchr/testify/assert/assertions.go:333 +0x378
github.com/stretchr/testify/assert.NoError({0x221e775deb8, 0xc000059a00}, {0x1b9bf20, 0xc00008fd28}, {0x0, 0x0, 0x0})
    D:/a/buildkit/buildkit/vendor/github.com/stretchr/testify/assert/assertions.go:1495 +0x105
github.com/stretchr/testify/require.NoError({0x1ba0970, 0xc000059a00}, {0x1b9bf20, 0xc00008fd28}, {0x0, 0x0, 0x0})
    D:/a/buildkit/buildkit/vendor/github.com/stretchr/testify/require/require.go:1357 +0x96
github.com/moby/buildkit/solver/llbsolver/mounts.TestCacheMountLockedRefs.func2()
    D:/a/buildkit/buildkit/solver/llbsolver/mounts/mount_test.go:291 +0x98
created by github.com/moby/buildkit/solver/llbsolver/mounts.TestCacheMountLockedRefs
    D:/a/buildkit/buildkit/solver/llbsolver/mounts/mount_test.go:289 +0x75d
FAIL    github.com/moby/buildkit/solver/llbsolver/mounts    44.633s

That's a real failure (assert.NoError would not have called assert.Fail if err was nil) but since it panicked, we never saw the content of that error.

Given that the panic happened more than 15 seconds after the test reported failure, potentially a much higher timeout is required on Windows in order to catch an actual failure. Or maybe Go should actually tell us what the error was in that panic... not that we can change that. (We could locally output the error before calling assert.NoError as that might make it easier to track down, without increasing the timeout; I think an extra 28 seconds in a failed run isn't a lot on CI, but this might work for tight-loop reproduction... although running just this test in a loop might destroy the goroutine before it finishes, hiding the failure? I'm not sure if Go enforces joining on goroutines at process termination.)

Also side-note, it looks like the test runner, when it sees two failures for one test like that, gives a summary (quoted in https://github.com/moby/buildkit/issues/1685#issuecomment-1680330891) which includes the second failure, but gives the run-time of the first failure.

moby / buildkit

windows CI failure #1685