moby / buildkit

concurrent, cache-efficient, and Dockerfile-agnostic builder toolkit
https://github.com/moby/moby/issues/34227
Apache License 2.0
8.07k stars 1.13k forks source link

An intermediate stage gets built multiple times in parallel #2030

Open haines opened 3 years ago

haines commented 3 years ago

Expected behaviour

When building multiple images from a multistage Dockerfile, Buildkit only builds each stage once.

Actual behaviour

When some layers are cached, Buildkit builds one of the intermediate stages multiple times in parallel (once for each of the target stages that depend on it).

Background

We have a multistage Dockerfile with 18 stages, and we use docker buildx bake to produce images from 7 of those stages (the other 11 being intermediate stages required to build the final 7).

One of the intermediate stages, packs, runs webpack, which consumes a lot of memory. 4 of the 7 final stages depend on the packs stage.

Normally, everything works fine: the packs stage gets built once, and we copy files from it into the dependent stages.

The problem

On one branch of our repository, though, we've run into a snag: the packs stage is being built separately for each of the final stages that depend on it. Running webpack 4 times in parallel exhausts the memory on the build server, and the process dies with exit code 137.

What's particularly surprising is that the issue only occurs if we have build cache from the main branch on the build server. If I docker rm -f -v buildx_buildkit_default to get back to a clean state, then the problematic branch builds fine, only invoking packs once. However, if I first build the main branch after dropping the Buildkit container, then build the problematic branch, it always invokes packs 4 times in parallel.

The difference between the problematic branch and main is limited to some file renames and corresponding path changes in the Dockerfile, which busts the cache on some of the earliest layers in the build.

Reproduction / logs

I haven't been able to extract a minimal reproduction - I've tried with simple Dockerfiles and can't trigger the issue. I have the Dockerfiles and build logs here in case it helps.

Any ideas on what might be triggering this behaviour, or how to debug it, would be hugely appreciated!

tonistiigi commented 3 years ago

Hard to say anything about it without a runnable reproducer as the files in https://github.com/zencargo/buildkit-multistage-build-issue are quite huge. Also unclear what is the difference between working/non-working dockerfile for the issue case?

I guess it might have something to do with the session not currently being shared for multi requests via bake. That means build context with local files remains different. But after checksums are checked solver should merge all of them together. This means that it is expected to have multiple rows in progress but they should be backed by the same process. Afaics that is not the case based on output.

Do I understand correctly that this is with regular local cache?

haines commented 3 years ago

Hard to say anything about it without a runnable reproducer

Yeah, understood, I will try again to create a reproduction.

Also unclear what is the difference between working/non-working dockerfile for the issue case?

There's not much difference - basically we are restructuring the application repository by moving about 4000 files from frontend/ to frontend/src/, so the changes in the Dockerfile are pretty much just updating paths. The result is that a lot of the early layers in the build miss the cache.

This means that it is expected to have multiple rows in progress but they should be backed by the same process. Afaics that is not the case based on output.

Yeah, when we check on the build server with htop we can see multiple instances of the process running, and the memory usage increasing accordingly.

Do I understand correctly that this is with regular local cache?

Yep, that's right. It seems to be totally deterministic - without cache it always runs 1 process and with cache from main it always runs 4.

akkuman commented 2 years ago

I have same problem, the log can be seen at https://github.com/akkuman/docker-msoffice2010-python/runs/4322540814?check_suite_focus=true Build and push, It builds downloader concurrently many times

And my workflowy file in https://github.com/akkuman/docker-msoffice2010-python/blob/7b90ab8/.github/workflows/docker.yml

tonistiigi commented 2 years ago

@akkuman That looks like some kind of display issue. All these steps have the same id #13 so it means that in builder it only ran once. Hard to know why it didn't stop printing after the first one without a runnable reproducer(it looks like your current case requires a specific cache state).