Open haines opened 3 years ago
Hard to say anything about it without a runnable reproducer as the files in https://github.com/zencargo/buildkit-multistage-build-issue are quite huge. Also unclear what is the difference between working/non-working dockerfile for the issue case?
I guess it might have something to do with the session not currently being shared for multi requests via bake. That means build context with local files remains different. But after checksums are checked solver should merge all of them together. This means that it is expected to have multiple rows in progress but they should be backed by the same process. Afaics that is not the case based on output.
Do I understand correctly that this is with regular local cache?
Hard to say anything about it without a runnable reproducer
Yeah, understood, I will try again to create a reproduction.
Also unclear what is the difference between working/non-working dockerfile for the issue case?
There's not much difference - basically we are restructuring the application repository by moving about 4000 files from frontend/
to frontend/src/
, so the changes in the Dockerfile are pretty much just updating paths. The result is that a lot of the early layers in the build miss the cache.
This means that it is expected to have multiple rows in progress but they should be backed by the same process. Afaics that is not the case based on output.
Yeah, when we check on the build server with htop
we can see multiple instances of the process running, and the memory usage increasing accordingly.
Do I understand correctly that this is with regular local cache?
Yep, that's right. It seems to be totally deterministic - without cache it always runs 1 process and with cache from main it always runs 4.
I have same problem, the log can be seen at https://github.com/akkuman/docker-msoffice2010-python/runs/4322540814?check_suite_focus=true Build and push, It builds downloader concurrently many times
And my workflowy file in https://github.com/akkuman/docker-msoffice2010-python/blob/7b90ab8/.github/workflows/docker.yml
@akkuman That looks like some kind of display issue. All these steps have the same id #13
so it means that in builder it only ran once. Hard to know why it didn't stop printing after the first one without a runnable reproducer(it looks like your current case requires a specific cache state).
Expected behaviour
When building multiple images from a multistage Dockerfile, Buildkit only builds each stage once.
Actual behaviour
When some layers are cached, Buildkit builds one of the intermediate stages multiple times in parallel (once for each of the target stages that depend on it).
Background
We have a multistage Dockerfile with 18 stages, and we use
docker buildx bake
to produce images from 7 of those stages (the other 11 being intermediate stages required to build the final 7).One of the intermediate stages,
packs
, runs webpack, which consumes a lot of memory. 4 of the 7 final stages depend on thepacks
stage.Normally, everything works fine: the
packs
stage gets built once, and we copy files from it into the dependent stages.The problem
On one branch of our repository, though, we've run into a snag: the
packs
stage is being built separately for each of the final stages that depend on it. Running webpack 4 times in parallel exhausts the memory on the build server, and the process dies with exit code 137.What's particularly surprising is that the issue only occurs if we have build cache from the main branch on the build server. If I
docker rm -f -v buildx_buildkit_default
to get back to a clean state, then the problematic branch builds fine, only invokingpacks
once. However, if I first build the main branch after dropping the Buildkit container, then build the problematic branch, it always invokespacks
4 times in parallel.The difference between the problematic branch and main is limited to some file renames and corresponding path changes in the Dockerfile, which busts the cache on some of the earliest layers in the build.
Reproduction / logs
I haven't been able to extract a minimal reproduction - I've tried with simple Dockerfiles and can't trigger the issue. I have the Dockerfiles and build logs here in case it helps.
Any ideas on what might be triggering this behaviour, or how to debug it, would be hugely appreciated!