Closed hausdorff closed 1 year ago
I've put @lukehoban on this and put it in M18, but those should be thought of as suggestions. :)
Adding a bit more context on why this is an (big) issue imho:
I'm working on making pulumi our go-to tool for all kind of k8s build and deployment. I opted to create a pulumi stack in each repo / app that should be deployed to k8s. Each stack then contains the docker build AND k8s manifests expressed as a single pulumi program. As a result all of our pulumi stacks have at least 1 dockerfile, one of them actually has 10 dockerfiles.
The issue above causes multiple problems which are detrimental to the pulumi UX, especially in the repo with 10 dockerfiles:
pulumi preview
and pulumi up
get very slow. This is particularly a problem for pulumi preview
which users expect to be kinda fast. The slowness is caused by the slowness of docker build itself:
cacheFrom: true
pulumi always runs a docker pull <imagename>
even during preview, which causes docker attempting to download all layers. It takes a few seconds for docker to conclude that it has all the layers already locally.docker build
with all layers cached takes still a few milliseconds / seconds per layer, especially if the layers are large.diagnostics
section in pulumi produces 3 lines per cached layer to inform the user that the build step was taken from the cache. In the case of our repo with 10 dockerfiles, with ~11 layers each it means we get more than 330 lines of noisy, not particularly useful diagnostics log output on every preview. The diagnostics output contains hundreds of lines which look like this:
docker:image:Image: foo-bar
info: Sending build context to Docker daemon 125.6kB
Step 1/11 : FROM node:8
---> 6f62c0cdc461
Step 2/11 : ARG NPM_TOKEN
---> Using cache
---> c0801f54dc29
Step 3/11 : WORKDIR /app
---> Using cache
---> c53511307e50
Step 4/11 : COPY package.json package-lock.json ./
---> Using cache
---> 7313d20117f9
Step 5/11 : RUN echo "//registry.npmjs.org/:_authToken=${NPM_TOKEN}" >> ~/.npmrc && npm install
---> Using cache
---> 6024564eef28
Step 6/11 : ENV TS_NODE_TRANSPILE_ONLY=true APP__NOCLUSTER=true
---> Using cache
---> 920a379cc6e3
Step 7/11 : COPY package.json package-lock.json* tsconfig.json* ./
---> Using cache
---> 2c52eae743bf
Step 8/11 : COPY conf* config* conf/
---> Using cache
---> b4c2f9ff0437
Step 9/11 : COPY conf* config* config/
---> Using cache
---> 51f8ada01947
Step 10/11 : COPY src src
---> Using cache
---> e3056cb71320
Step 11/11 : ENTRYPOINT npm start
---> Using cache
---> c6bc04e903dc
Successfully built c6bc04e903dc
Successfully tagged gcr.io/foo/bar:latest
pulumi up
is even slower because the docker push
step also takes a couple of seconds until docker concludes that all layers are already present in the remote registry.Altogether this single issue has a large negative impact on the pulumi docker / k8s UX and should be prioritized accordingly.
To be clear: I really love pulumi, but personally I consider this particular issue the biggest meh of pulumi compared to other k8s dev tools.
Tools which do a much better job at this (i.e. they avoid rebuilding dockerfiles everytime) include: skaffold, forge.sh and possibly (haven't tested myself): draft, devspace. Those tools have certainly a different (and smaller) scope than pulumi, but it'd like to avoid adding additional tools to the mix if pulumi can be tweaked to be fast enough for the inner dev cycle by itself.
Side Note: In a perfect world docker itself would be a bit smarter and faster about this, but I don't think this is gonna happen anytime soon. Docker 18.06 brings experimental support for BuildKit as next-gen docker build implementation which goes in the right direction but there's still a lot of things to solve (i.e. faster / more intelligent remote caching, instead of explicit docker pull etc.).
I'm also pasting here an extract of a private conversation I had with @hausdorff a few days ago from which this issue originated:
geekflyer wrote:
While I think pulumi is good in terms of feedback loop, I don’t that this is really pulumi’s strength as of now. From my perspective pulumi as of now has a reliable deployment model with automatic built-in feedback, that is very suited for automatic deployments used by CI / CD etc. But for local development / iteration with quick deployment to k8s I think there are better tools out there. To be concrete: Skaffold, forge.sh and Draft. We currently actually use forge.sh for the few k8s apps we deploy in production on k8s. forge.sh doesn’t have a semantic understanding of k8s objects, but it has a built in docker build+push > deterministic auto-image tagging > injecting tags in yaml > deploy to k8s workflow. Usually I use it then in conjunction with
kail
to see if a pod is up an running and what it’s log says. Skaffold is similar to that but even better in some aspects:
- Skaffold has somewhat of an understanding of the k8s artifacts and in return doesn’t necessarily redeploy everything if you just changed a single manifest and it also ensures that pods are starting successfully and shows their logs.
- Skaffold has a built-in file watch mode which auto redeploys on any file change
In general the feedback loop with skaffold is really short, but it’s by far not as flexible as pulumi.
I think some things where pulumi could improve - inspired by skaffold would be:
- some sort of watch mode
- e2e docker build > deploy workflow with automatic image tag management (I can already build and deploy with pulumi in one workflow right now, but the image tag management isn’t been taken care of automatically yet)
- don’t run docker build at all, if nothing in a dockerfile’s context has changed. Reason: Even if all layers are cached, docker build actually takes a couple of seconds, that’s why skaffold for example avoids repeating that step.
@hausdorff already clarified that 2. is already being partially taken care of by using the imageName output of docker.Image
which contains the image sha256.
Related (just a subset but interesting): https://github.com/pulumi/pulumi-cloud/issues/183
As someone who frequently demos the Docker build support, often with awkward pauses in the middle, I am a huge fan of pursuing these optimizations :smile: 👍
relates to https://github.com/pulumi/pulumi/issues/2052
Moving out. We don't have a plan for this for m18. One thing we are considering (and @hausdorff to fill in more details) is that we may move to a provider-model for this where we can call into the docker APis directly. As part of that, we're considring looking into an opt-in/out model where we can have a behavior that tries to infer if a change will happen or not just by examining the local file system. In other words, if the local file system is unchanged, you will be able to either opt-into (or out of) a behavior where we assume that means that a docker build won't produce anything different.
This has to be something under user control though as this is simply a weak approximation. A docker build may always end up producing a new image, even if nothing on disk changed.
With this, a user could then decide between saying "i always want docker to run, to get the most accurate results" vs "i'm ok assuming that running docker will not change anything if all my files on disk are ok".
With this approach, the cost then just comes down to check for changes on disk. In general, this is something we do a good enough job with, as we already have to do that for any sort of update, given that that's how we do things like determine what to upload with an AWS Lambda. This should hopefully be less work overall than what docker does, resulting in a boost for this scenario.
Giving to @hausdorff as he felt he had the best understanding of what to do here, esp. with his existing understanding of how to create a custom provider.
We're going to take on the docker work in m20.
Even the ability to use the cacheFrom
arg would be helpful in the docker.Image class.
It seems like we currently have to choose between the more idiomatic docker.Image resource to the function docker.buildAndPushImage function
Improvement on this front would be really cool :+1: I keep circling back to try Pulumi and every time I do I run into this issue, and remember why I gave up on Pulumi before :smile: Hopefully this improves at some point.
Any updates on this? Building images is painfully slow, while the rest of the functionality we are evaluating is performing well enough. Building an certain image, that consists out off a number of large layers that rarely change, using the docker command line takes about 2 minutes :
In Pulumi it consistently takes more than 12 !:
Also a real problem for us - it's generally taking about 15-20 minutes for pulumi up to build our Docker images locally, where just doing a docker build takes 2-3.
I forgot to cross-post in here, but one option is to use the RegistryImage
resource (see https://github.com/pulumi/pulumi-docker/issues/132#issuecomment-812234110 for more details on using it as a replacement for Image
)
There's also a use case where it'd be great to pull an image to populate the cache from a different registry than where it's ending up. As an example:
From what I can tell, this is not currently possible? cacheFrom
only seems to work with the destination registry.
@geekflyer et al., this issue is resolved with the new implementation of the Docker Image resource in v4! See our blog post for more info: https://www.pulumi.com/blog/build-images-50x-faster-docker-v4/
From a user: