moby / buildkit

concurrent, cache-efficient, and Dockerfile-agnostic builder toolkit
https://github.com/moby/moby/issues/34227
Apache License 2.0
8.12k stars 1.15k forks source link

S3 cache does not work in long time build #3903

Open breezewish opened 1 year ago

breezewish commented 1 year ago

Hi,

I'm using AWS CodeBuild with docker/buildx to build images, with S3 caches.

I discovered that, when Dockerfile takes long time to build (for example, > 1h), S3 cache export will fail:

311 | #14 exporting cache to s3
312 | #14 preparing build cache for export
313 | #14 preparing build cache for export 1.1s done
314 | #14 ERROR: failed to check file presence in cache: operation error S3: HeadObject, https response error StatusCode: 400, RequestID: 4TA5ZWBP87VVWRV6, HostID: 5w0wITkTuGBfiilslW3FLFSpK1vpJL4+SND7JJfSEqW5hWV+jBiDQ28OPsdVPPyz61U8COsQ+ak=, api error BadRequest: Bad Request
315 | ------
316 | > exporting cache to s3:
317 | ------
318 | ERROR: failed to solve: failed to check file presence in cache: operation error S3: HeadObject, https response error StatusCode: 400, RequestID: 4TA5ZWBP87VVWRV6, HostID: 5w0wITkTuGBfiilslW3FLFSpK1vpJL4+SND7JJfSEqW5hWV+jBiDQ28OPsdVPPyz61U8COsQ+ak=, api error BadRequest: Bad Request

When Dockerfile takes shorter time to build (like 20min), S3 cache export will succeed:

#14 exporting cache to s3
--
311 | #14 preparing build cache for export
312 | #14 writing layer sha256:66eb4459daf389acf01507afdc8386ac4963bfc1dd5d19adb352cd6324daf3b8
313 | #14 writing layer sha256:66eb4459daf389acf01507afdc8386ac4963bfc1dd5d19adb352cd6324daf3b8 0.4s done
314 | #14 writing layer sha256:96d61c37949b0e8155d6f6198bd17bdd1168d2d97cb04ad081c2bf08dfd5278d
315 | #14 writing layer sha256:96d61c37949b0e8155d6f6198bd17bdd1168d2d97cb04ad081c2bf08dfd5278d 0.3s done
316 | #14 preparing build cache for export 5.7s done
317 | #14 DONE 5.7s

I suspect it might be caused by the default 1h session duration. However I cannot find ways to extend it. Using AK and SK cannot help in my case, because our security policy disallows the usage of AK and SK and we must use passwordless authentications.

Thanks!

Example reproduce:

Dockerfile:

# syntax=docker/dockerfile:1

FROM centos:7 AS centos-base

RUN --mount=type=cache,target=/var/cache/yum,sharing=locked \
    yum install -y epel-release centos-release-scl

RUN sleep 65m

RUN --mount=type=cache,target=/var/cache/yum,sharing=locked \
    yum install -y curl wget \
    && yum update -y ca-certificates

CodeBuild spec (buildspec.yml):

version: 0.2

phases:
  install:
    runtime-versions:
      docker: 20
    commands:
      - docker version
      - curl -JLO https://github.com/docker/buildx/releases/download/v0.10.4/buildx-v0.10.4.linux-amd64
      - mkdir -p ~/.docker/cli-plugins
      - mv buildx-v0.10.4.linux-amd64 ~/.docker/cli-plugins/docker-buildx
      - chmod a+rx ~/.docker/cli-plugins/docker-buildx
  build:
    commands:
      - docker buildx create --use --driver=docker-container
      - |
        docker buildx build ./test-timeout \
          --cache-from type=s3,bucket=...,region=us-east-1,name=codebuild-exp \
          --cache-to type=s3,bucket=...,region=us-east-1,name=codebuild-exp,mode=max
janekmichalik commented 1 year ago

@breezewish are u using IAM policies? I got such error when auth session/creds had expired. And I bet it is exactly the same issue going on here. Find a way to extend the auth session/creds duration .

breezewish commented 1 year ago

@breezewish are u using IAM policies? I got such error when auth session/creds had expired. And I bet it is exactly the same issue going on here. Find a way to extend the auth session/creds duration .

Yes, it should be the same issue. However due to IAM Role Chaining it seems to be impossible to just open a new session in the CodeBuild env with a longer session duration for Docker.

itsmonktastic commented 1 year ago

I've been experiencing this too today, doing something like this:

docker buildx create --driver=docker-container --use
export AWS_PROFILE=<profile>
docker buildx build --cache-to 'type=s3,...'

The profile backs onto a credential_process that obtains short lived credentials of 15 minutes. This was working great, but I started bumping into the expiry time for the credentials issued to our internal CI system and received:

#109 exporting cache to s3
#109 preparing build cache for export
#109 preparing build cache for export 53.8s done
#109 ERROR: failed to check file presence in cache: operation error S3: HeadObject, https response error StatusCode: 400, RequestID: AYJYEKEBKVJ72ARX, HostID: AumhGqp2ttKYlyJJjMvybDZJM1AGVvLaR6a64utX4qo0Cfz7HqRmvMst8fnbErBpnuBdDESASY323cPlWfUpoA==, api error BadRequest: Bad Request

I am not a seasoned reader of golang code or the moby ecosystem, but I think the issue might be that buildkitd is the piece doing the S3 export, relying on credentials (key id, secret, session token) passed to it, generated one time by buildx here https://github.com/docker/buildx/blob/687feca9e8dcd1534ac4c026bc4db5a49de0dd6e/util/buildflags/cache.go#L102

I think technically buildkit is not to blame here, because reading the buildkitd code it has a fairly straightforward use of the AWS SDK that should generate and refresh credentials as needed. I think it's the buildx -> buildkitd interaction that has the problem.

I wonder what might be a fix for this? Perhaps there is some other approach that would let me use buildkitd without this issue? It seems that either buildx and buildkitd need some back and forth to refresh credentials, or it would be necessary to mount the AWS shared credentials file (and the tool used by credential_process) to the buildkitd container, so it can generate its own credentials dynamically instead