moby / buildkit

concurrent, cache-efficient, and Dockerfile-agnostic builder toolkit
https://github.com/moby/moby/issues/34227
Apache License 2.0
8k stars 1.12k forks source link

Improved debugging support #1472

Open tonistiigi opened 4 years ago

tonistiigi commented 4 years ago

addresses #1053 addresses #1470

An issue with the current build environment is that we often assume everyone can write a perfect Dockerfile from scratch without any mistakes. In real-world there is a lot of trial and error for writing a complex Dockerfile. Users get errors, need to understand what is causing them, and react accordingly.

In the legacy builder, one of the methods for dealing with this situation was to use --rm=false or look up the image ID of the last image layer from the build output and run docker run session with it to understand what was wrong. Buildkit does not create intermediate images nor make the containers it runs visible in docker run (both for very good reasons). Therefore this is even more complicated now and usually requires the user to set --target to do a partial build and the debug the output of it.

To improve this, we shouldn't try to bring back --rm=false that makes all the builds significantly slower and makes it impossible to manage storage for build cache. Instead, we could provide a better solution for this with a new --debugger flag.

Using --debugger on a build, should that build error, will take the user into a debugger shell similar to interactive docker run experience. There the user can see the error and use control commands to debug the actual cause.

If the error happened on a RUN command (execop in LLB), the user can use shell to rerun the command and keep tweaking it. This will happen in an identical environment to the one where execop runs, for example, this means access to secrets, ssh, cache mounts etc. They can also inspect the environment variables and files in the system that might be causing the issue. Using control commands, a user can switch between the broken state that was left behind by the failed command and the initial base state for that command. So in the case where they would try many possible fixes but end up in a bad state, they can just restore back to the initial state and start again.

If the error happened on a copy (or other file operation like rm), they can run ls and similar tools to find out why the file path is not correct and not working.

For implementation, this depends on https://github.com/moby/buildkit/issues/749 for support to run processes on build mounts directly without going through the solver. We would first start by modifying the Executor and ExecOp to instead of releasing the mounts after error, return them together with the error. I believe typed errors https://github.com/moby/buildkit/pull/1454 support can be reused for this. They should be returned up to the client Solve method, who can then decide to call llb.Exec with these mounts. If mounts are left unhandled, they are released with the gateway api release.

Once the debugging has completed, and the user has made changes to the source files, it is easy to trigger a restart of the build with exactly the same settings. This is also useful if you think you might be hitting a temporary error. If the retry didn't fix it, user is brought back to the debugger.

It might make sense to introduce a concept of "debugger image" that is used as a basis of the debugging environment. This would allow avoiding hardcoded logic in an opinionated area.

Later this could be extended with the step-based debugger, and source mapping support could be used to make source code changes directly in the editor or tracking dependencies in the build graph.

@hinshun

hinshun commented 4 years ago

Regarding the "debugger image", my colleague @slushie did some interesting work with sharing a mount namespace (partial containers) with a image that has debugging tools: https://github.com/slushie/cdbg

In that repository, there's a prototype of gdb in the debugging image, attaching to the process of a running container.

This may be useful to debug scratch images or minimal images that may not have the basic tools like a shell binary.

fuweid commented 4 years ago

/cc

tonistiigi commented 3 years ago

@coryb Now that Exec support has landed how big job do you estimate it to be to return the typed errors from execop/fileop that would allow running exec from the error position and position from the start of the op. Wondering if we should target that for v0.8 or not. We could potentially continue working on the client side ux after v0.8 is out. Already added #1714 to v0.8 that I think is a requirement.

coryb commented 3 years ago

I am working on #1714 now, I am guessing a week+ before I have something viable for that.

I have not really looked into the change required for this yet. I think @hinshun has some ideas and is generally more familiar with this than I am. I will sync up with him and maybe twist his arm to help out 😄 I think we can try to break down what is remaining for this and try to come up with some estimates.

ag-TJNII commented 3 years ago

Using --debugger on a build, should that build error, will take the user into a debugger shell similar to interactive docker run experience. There the user can see the error and use control commands to debug the actual cause.

Interactive shells being the only option is going to leave much to be desired when building in CI pipelines. I often use Docker in CI pipelines where the build command has no terminal to drop to or is a direct API call; having the only option be "run interactive" is not inline with current automated build best practices. Please consider an option to allow sideband inspection of buildkit layers, similar to how the legacy docker build works. Thanks.

lyager commented 3 years ago

I've just upgraded Docker for Mac, which uses BUILDKIT as its default engine. Not feeling very comfortable with the suggested nsenter solution since the project is deprecated (or at least marked 'read-only'). Just wanted to give a +1 for getting this fixed. --debugger sounds like a great solution, maybe even letting it switch directly into interactive shell when a build step fails.

lyager commented 3 years ago

Just wanted to follow up, changing the backend while building works for me: DOCKER_BUILDKIT=0 docker build . - but I must admit the speed of using buildkit is nice!

JoelTrain commented 3 years ago

I agree. Having the image of the layer immediately prior to the issue makes it incredibly handy to run an interactive container immediately prior to the problem to poke around.

I guess for now I will run DOCKER_BUILDKIT=0 docker build . as a work around when debugging new dockerfiles

so that I can get the image ids in the output again

Step 2/12 : WORKDIR /usr/src/app ---> Running in 14307a565858 Removing intermediate container 14307a565858 ---> 472b33608107 Step 3/12 : COPY ./package.json . ---> 40293e6966f5 Step 4/12 : COPY ./package-lock.json . ---> e91be6e9c9c6 Step 5/12 : RUN npm install ---> Running in dc762b24b192

$ docker run -it --rm e91be6e9c9c6 sh
/usr/src/app #
gtmtech commented 3 years ago

Is there any solution in this space yet (that doesn't involve nsenter or regressing to DOCKER_BUILDKIT=0). I cant quite believe that it's coming up for 2 years since https://github.com/moby/buildkit/issues/1053 was raised and nobody has been able to debug docker buildkit builds since - it sounds like something that is as common a usecase as you could get?

Can't find any example of active work to resolve this issue, might step in and help out if there's nothing in the pipeline

tonistiigi commented 3 years ago

I don't know what you mean by nsenter solution but that is not recommended. What you can do is create a named target to the position of the dockerfile you want to debug, build that target with --target and run it with docker run.

matt2000 commented 3 years ago

Just chiming in with a user perspective, after being put in a new environment where BUILDKIT appears to be the default, this is a decidedly worse experience than the past. Clearly the layers are being cached. I'd guess the simplest solution with a "backward compatible user experience" might be to just automatically export the last cached layer to the image store, and display its hash, whenever there is an error in docker build. Named targets for debugging feel like an awkward misuse of the feature, since the old way was "automatic."

strelga commented 3 years ago

@tonistiigi Do you plan to take this issue in development in any near future? Does it have blockers now?

itcarroll commented 3 years ago

The --target option is not recognized by docker-compose build (version 1.28.5), so I'm sadly resorting to DOCKER_BUILDKIT=0.

KevOrr commented 3 years ago

The --target option is not recognized by docker-compose build (version 1.28.5), so I'm sadly resorting to DOCKER_BUILDKIT=0.

Iirc, when using Compose, target is a field in the build: subsection of a service definition

edit: https://github.com/compose-spec/compose-spec/blob/master/build.md#target

willemm commented 3 years ago

The proposed option mentioned in #1053 , where you can specify that it should create the image even on failure, would be very helpful. It would even be helpful if you could just enhance the --output option with a flag that it also outputs on failure.

emmahyde commented 3 years ago

This would be fantastic. It's the only thing holding me back from moving over to buildkit full time!

NicolasDorier commented 3 years ago

Just want to say that it is VERY painful to not be able to interatively debug intermediate images... It really makes debugging a problem in 5 min take a 2 Hour long process...

cburgard commented 3 years ago

After switching to buildkit recently because of the secret-mount option, I've just spent about half an hour trying to figure out what magical command I need to show the images in the buildkit cache, the apparent answer being "it's not possible". I find it hard to believe that this issue still persists...

tonistiigi commented 3 years ago

You can add a multi-stage split anywhere in your Dockerfile and use --target to build the portion you want to turn into a debug image.

hraban commented 3 years ago

A temporary work-around is docker-compose, which (as of writing, v1.29.2) still doesn't use build kit when you do docker-compose run. You can create a simple docker-compose file with context: ., use docker-compose run --rm yourservice, which will then try to build it and print hash ids along the way. But if you use docker-compose build, it already uses buildkit, so this workaround is most likely on its way out. As is docker-compose itself, iirc?

Ghoughpteighbteau commented 2 years ago

Just an a bit of information for people who are trying to figure out how to enter into a debug state. It may be helpful to spell out tonistiigi's work around! If you're just figuring docker out it might not be obvious what they mean when they say this. Here's a quick guide:

Lets say you have this Dockerfile

FROM archlinux:latest

# Initial package load
RUN pacman -Syu --noconfirm
RUN pacman -S --needed --noconfirm git base-devel sudo bash fish

RUN explode

# User
RUN useradd -m user\
 && echo "user ALL=(ALL) NOPASSWD:ALL" >/etc/sudoers.d/user
USER user
WORKDIR /home/user

I run docker buildx build --load -t arch . to build it, but it blows up at RUN explode. I wanna debug it.

First, Modify the starting FROM like this:

FROM archlinux:latest as working

then add this right before the break point

FROM working
RUN explode

Now just run docker builtx build --load --target working -t arch . && docker run -it arch sh

Now you're in right before the command that blew up. Hope that helps debugging!

Aposhian commented 2 years ago

Even if it is not yet possible to run containers on intermediary layers in the build kit cache, is there a way that one could extract the cache layers to view as a filesystem diff?

alexanderkjeldaas commented 2 years ago

I added this comment on #1470 as I don't think this issue fully represents the problem identified by #1470. Basically multi-stage builds where multiple images should be exported is possibly not a common, but very useful technique for speeding up CI/CD builds.

This requires not debugging support, but something more akin to a Dockerfile command to explicitly push a stage, supporting multiple --target parameters or similar.

Running multiple docker build with different --target options does not work as it is not composable.

chrisawad commented 2 years ago

This can give you a look at a the point after a successfully completed stage:

DOCKER_BUILDKIT=1 docker build --target <stage> -t test . docker run --rm -it test bash

But unlike when DOCKER_BUILDKIT=0, I don't think there's a way to see the hash for each layer created in the image so you can't just jump in right before the error and test at the moment of failure.

Highly unfortunate, and a big deal if you ask me!

kingbuzzman commented 2 years ago
$ docker --version
Docker version 20.10.14

DOCKER_BUILDKIT=0 docker build .. doesn't seem to work anymore. I no longer get the hashes

ktock commented 2 years ago

FYI:

I'm recently implemented an experimental interactive debugger for Dockerfile : buildg https://github.com/ktock/buildg

Also in buildx, discussion is ongoing towards interactive debugger support and UI/UX: https://github.com/docker/buildx/issues/1104

yambottle commented 2 years ago
terekcampbell commented 1 year ago

It's been quite some time since there's been movement here. Can we get an update on this?

ptrxyz commented 1 year ago

I fully support the idea of getting the hashes of each layer back. Maybe a good compromise would be to at least display the hash of the layer a failing command was run in?

rfay commented 1 year ago

Hashes of each later would help so much.

Derekt2 commented 1 year ago

still using DOCKER_BUILDKIT=0 to get image layer hashes, why not at least give the hashes when --progress=plain is specified?

TBBle commented 1 year ago

Because it's not simply "give the hashes", those hashes (i.e. what you see in the legacy builder) do not exist until the export stage of the build, and generating them by exporting each layer as it's built into an image would be a non-trivial operation that makes BuildKIt slower for everyone, and require redesigning the BuildKit build process to know about and use the chosen image exporter much earlier in the build than it does now.

As mentioned earlier, the solution for your actual problem (debugging failed builds in docker buildx) is being worked on over in https://github.com/docker/buildx/issues/1104; PR6 landed last month, and PR7+8 are currently under-review.

Given that the BuildKit work to implement debugging was completed almost a year ago (Exec in the gateway API, and resolving and passing-up content IDs to the client when a build fails), I'd suggest closing this issue and redirecting people to follow the remaining work in buildx, as it does not seem like there's remaining scope for productive discussion in this ticket.

mmerickel commented 1 year ago

I just want the hash of the last layer built prior to the failure. Don’t need the hash of every later exported.

TBBle commented 1 year ago

That's what https://github.com/moby/buildkit/issues/1472#issuecomment-941628522 does now, by making the "last layer" the final layer, so BuildKit can export an image, since that's all it knows how to do. Anything more would only be workable when BuildKit is being used with Docker directly (and knows it), and buildx exists to contain those cases.

What other use do you have for intermediate image generation and hash output that isn't hand-implementing https://github.com/docker/buildx/issues/1104 and isn't trying to build https://github.com/moby/buildkit/issues/1472#issuecomment-941628522 directly into BuildKit instead of buildx?

willemm commented 1 year ago

My use case is actually to access the test report files after a failed unit test step. At the moment we use a separate target that has the unit test as last step with a " || echo failed" at the end to always succeed so we have an image to extract the test report from. But that requires building the dockerfile twice in each build, and specially tuning all the dockerfiles to support this. So access from an automated script to the build/state/files after a failed build would be very useful.

TBBle commented 1 year ago

Okay, so that's a use-case that isn't supported by the legacy builder either, AFAIR, it never created an image out of a failed step.

I hope you'll be pleased to know that PR8 of https://github.com/docker/buildx/issues/1104 is implementing both "Execute in container at start of failed step" (similar to legacy builder "write-down layer ID and docker run it") and "Execute in container after failed step" (new! and the default) in the monitor via proposed docker buildx build --invoke=on-error, so you can get access to those files through this, I expect. It's currently being worked on (and you can see a more-detailed usage example) in https://github.com/docker/buildx/pull/1640.

Based on this work, it would probably also be possible to implement in buildx something that can actually export an image from either the start or end of a failed step, since (I think) BuildKit now sends enough information on failure for buildx to request an image export of the container state, and buildx has enough information to tell BuildKit where to send such an image.

I don't immediately see an open feature-request in buildx for that, and I suspect it wouldn't be worked on until https://github.com/docker/buildx/issues/1104 is completed (since the work heavily overlaps).

It's also possible that I'm wrong and the infrastructure that supports https://github.com/docker/buildx/issues/1104 is not sufficient to support buildx exporting either or both of the before and after images of a failed build step.

So yeah, I suggest you open a feature request for your use-case on buildx, and see what the buildx maintainers think. (I'm not a buildx maintainer; I'm not super familiar with that codebase, and I have no particularly strong prediction on what they'll think of it. I hope they like it, it seems useful to me for, e.g., tests-run-during-container-build workflows.)

willemm commented 1 year ago

True, legacy didn't support that either. I was just throwing it out there as a use-case, and I am indeed pleased to know that information about PR8, thank you ^^

opinionmachine commented 1 year ago

So my usecase is to use docker build to run all the package restore, build, test (including coverage, static code analysis, static security analysis et c) and finally put the built artifact in a lightweight image. The only issue is I'd need to access the test output from the intermedate layer to push to the CI system, and that is possible with buildkit = 0, but as far as this discussion goes not possible with buildkit. Now I'm all for performance, but I'd love it it was possible to label and publish an intermediate layer manually for this specific case. Otherwise I need multiple dockerfiles, like a barbarian.

tonistiigi commented 1 year ago

You can use https://github.com/moby/buildkit/issues/1472#issuecomment-941628522 instead of multiple Dockerfiles. Or you can PR a change that adds an option to stop at a specific Dockerfile line.

opinionmachine commented 1 year ago

You can use #1472 (comment) instead of multiple Dockerfiles. Or you can PR a change that adds an option to stop at a specific Dockerfile line.

I don’t know how you do test coverage and test results, but I’d like to have the output every run, not just when tests break.

tonistiigi commented 1 year ago

If your case is that you want to build multiple things (stages) and push their results to different locations, not only your final build result then you can look into docker buildx bake https://docs.docker.com/build/bake/reference/ . Define all the points you want to access as separate targets and a single command will build them all together and push where needed.

tonistiigi commented 1 year ago

There are some new (experimental for now) debug options in new buildx release candidate: https://github.com/docker/buildx/releases/tag/v0.11.0-rc1

andyneff commented 1 year ago

If your case is that you want to build multiple things (stages) and push their results to different locations, not only your final build result then you can look into docker buildx bake https://docs.docker.com/build/bake/reference/ . Define all the points you want to access as separate targets and a single command will build them all together and push where needed.

I finally needed to use the experimental debug invoke, and I really like how it works! I hope it gets added to the bake command too, eventually. (And this too)

shapirus commented 10 months ago

So, considering all the experimental features, is there now a possibility to run a command (typically a shell) inside a build container?

With the normal builder, I can run docker ps, get the build container's ID from the output, then run docker exec -it <id> sh and get a shell running inside that container to inspect or run whatever I need there.

Does buildkit support this in any way, other than running an ssh reverse tunnel from inside the container in a RUN build step? It would be nice for it to support it before the normal builder is removed.

TBBle commented 10 months ago

@shapirus Does https://github.com/docker/buildx/blob/v0.11.2/docs/guides/debugging.md do what you want? The BuildKit-side requirements (low-level bits) are implemented; the buildx side is being built-out, was shipped experimentally in buildx 0.11 and hence Docker Desktop 4.22.0, and is looking for feedback at https://github.com/docker/buildx/issues/1104.

I'd suggest trying buildx 0.12.0-rc1 if you're interested in this feature, as the command-line was changed and the relevant docs are now at https://github.com/docker/buildx/blob/v0.12.0-rc1/docs/guides/debugging.md. That way any feedback you give is relative to the current state of development.

jedevc commented 10 months ago

@tonistiigi does it make sense to close this issue? Now that we're tracking things in https://github.com/docker/buildx/issues/1104, and the area/debug tag on buildx.

shapirus commented 10 months ago

Does https://github.com/docker/buildx/blob/v0.11.2/docs/guides/debugging.md do what you want?

Yes, from what I read there, it should solve it, as far as practical use cases are concerned. Thanks for the hint.