moby / buildkit

concurrent, cache-efficient, and Dockerfile-agnostic builder toolkit
https://github.com/moby/moby/issues/34227
Apache License 2.0
8.09k stars 1.14k forks source link

v0.11.5: interrupting `Solve()` no longer interrupts container processes #3751

Closed vito closed 1 year ago

vito commented 1 year ago

After bumping Dagger to Buildkit v0.11.5 we noticed service containers stay running rather than stopping after we interrupt them.

We tracked it down https://github.com/moby/buildkit/pull/3722 and confirmed reverting it fixes the issue.

Here's a test (for client/client_test.go) that will pass in v0.11.4 but fail in v0.11.5. It runs a container that sleeps and writes to a cache, interrupts it before the sleep finishes, waits for longer than the sleep, and then checks that the file doesn't exist in the cache.

func testInterruptKills(t *testing.T, sb integration.Sandbox) {
    requiresLinux(t)
    c, err := New(sb.Context(), sb.Address())
    require.NoError(t, err)
    defer c.Close()

    busybox := llb.Image("busybox:latest")

    withCache := llb.AddMount(
        "/wd",
        llb.Scratch(),
        llb.AsPersistentCacheDir("mycache1", llb.CacheMountShared),
    )

    writeDef, err := busybox.Run(
        llb.Shlex(`sh -c "sleep 10 && touch grass"`),
        llb.Dir("/wd"),
        withCache,
    ).Marshal(sb.Context())
    require.NoError(t, err)

    interruptCtx, cancel := context.WithTimeout(sb.Context(), time.Second)
    defer cancel()

    _, err = c.Solve(interruptCtx, writeDef, SolveOpt{}, nil)
    require.Error(t, err)
    require.Contains(t, err.Error(), "context deadline exceeded")

    time.Sleep(20 * time.Second)

    readDef, err := busybox.Run(
        llb.Shlex(`cat grass`),
        llb.Dir("/wd"),
        withCache,
    ).Marshal(sb.Context())
    require.NoError(t, err)

    // grass should not exist
    _, err = c.Solve(sb.Context(), readDef, SolveOpt{}, nil)
    require.Error(t, err)
}

There's probably a better way to actually test this, the above was just the first idea that popped into my head; it'd make for a very slow test.

I haven't looked too deeply at the original PR, but I did notice that it seems to switch from running runc kill <id> to sending SIGKILL to the runc process, which seems semantically different, but maybe that's the idea.

cc @coryb @tonistiigi @sipsma

coryb commented 1 year ago

Thanks for the report, I will take a look. Theoretically, the change was to send the sigkill to the process managed by runc (the pid1 in the container), not to runc itself. Clearly something is not matching my expectations.