Open Rongronggg9 opened 2 years ago
@Rongronggg9 #1044 It seems that the cache of the later push overwrites the cache of the previous platform push, so only the cache of the last platform can be saved
@mytting I've already checked that issue before but I believe that this issue differs from that:
In fact, I doubt that there are some undocumented limitations on the registry cache preventing all cache to be saved. ~If I do some tricks to shrink the cache size, the problem will probably be solved.~ Some of my repositories have similar workflows but they never face this issue while the issue mentioned in the issue description does.
I've figured out a workflow to reproduce this issue easily. Please check https://github.com/Rongronggg9/RSSHub/tree/reproduce-buildkit-cache-issue
Run attempt 1: https://github.com/Rongronggg9/RSSHub/actions/runs/2248798559/attempts/1
cache type | builder cache | exported cache [^1] | build time | rebuild time | cache-miss platform(s) |
---|---|---|---|---|---|
gha |
3.563G | 208M | 12m 58s | 10m 8s | linux/arm64 |
registry |
3.563G | 295M | 21m 44s | 6m 1s | / |
registry (uncompressed) |
5.177G | 748M | 14m 34s | 12m 21s | linux/amd64 , linux/arm64 , linux/arm/v7 (ERROR: failed to authorize: failed to fetch oauth token: Post "https://auth.docker.io/token": EOF ) |
local |
3.563G | 475M | 12m 42s | 6m 1s | / |
local (uncompressed) |
5.177G | 1.5G (350M pushed) | 12m 29s | 9m 41s | linux/arm64 |
Run attempt 2: https://github.com/Rongronggg9/RSSHub/actions/runs/2248798559/attempts/2
cache type | builder cache | exported cache [^1] | build time | rebuild time | cache-miss platform(s) |
---|---|---|---|---|---|
gha |
3.563G | 294M | 14m 41s | 5m 56s | / |
registry |
3.563G | 294M | 14m 46s | 6m 5s | / |
registry (uncompressed) |
5.177G | 1295M | 13m 37s | 10m 29s | linux/arm64 |
local |
3.563G | 303M | 20m 52s | N/A [^2] | N/A |
local (uncompressed) |
5.177G | 1.5G | 12m 20s | N/A [^2] | N/A |
The cache issue for the registry seems solved, right? That's probably because I've applied some tricks to shrink the cache size. What if making it huge again? Please check: https://github.com/Rongronggg9/RSSHub/tree/reproduce-buildkit-cache-issue-huge
Run attempt 1: https://github.com/Rongronggg9/RSSHub/actions/runs/2248914216/attempts/1
cache type | builder cache | exported cache [^1] | build time | rebuild time | cache-miss platform(s) |
---|---|---|---|---|---|
gha |
8.605G | 921M | 18m 36s | 11m 22s | linux/arm64 |
registry |
8.605G | 917M | 18m 14s | 11m 58s | linux/arm64 |
registry (uncompressed) |
12.25G | 3303M | 17m 45s | 11m 14s | linux/arm64 |
local |
8.605G | 1.1G | 18m 28s | 10m 31s | linux/arm/v7 |
local (uncompressed) |
12.25G | 1.9G (615M pushed) | 16m 57s | 13m 16s | linux/arm64 , linux/arm/v7 |
Run attempt 2: https://github.com/Rongronggg9/RSSHub/actions/runs/2248914216/attempts/2
cache type | builder cache | exported cache [^1] | build time | rebuild time | cache-miss platform(s) |
---|---|---|---|---|---|
gha |
8.605G | 483M | 21m 37s | 12m 39s | linux/arm64 , linux/arm/v7 |
registry |
8.605G | 917M | 19m 24s | 11m 31s | linux/arm64 |
registry (uncompressed) |
12.25G | 3319M | 26m 13s | 9m 58s | linux/arm/v7 |
local |
8.605G | 659M | 18m 31s | N/A [^2] | N/A |
local (uncompressed) |
12.25G | 3.4G | 16m 40s | N/A [^2] | N/A |
What a randomness!
It seems that there is some indeterminacy randomly preventing BuildKit (or maybe it's the fault of buildx?) from exporting the build caches of some platforms. Is the indeterminacy cache-type irrelevant, or relevant? Probably irrelevant.
The strangest thing is that it also occurs to the local
cache type. In theory, there shouldn't be anything preventing BuildKit to export the build cache to a local path.
I believe that as long as the rebuild job is run immediately after the build job, the GitHub Action cache was not shrunk yet (besides, the build caches within each run attempt are less than 10GB in total so even if shrunk, it would be the old ones to suffer, not them):
[^1]: For local
, it is the exported cache size; for gha
and registry
, it is the total TX bytes on eth0
.
[^2]: This does not have any meaning since the commit hash had not unchanged and the new cache was not pushed.
@Rongronggg9 Cache to inline, but cache from Registry. How is the cache pushed to Registry? How does this part of logic work? I didn't understand. I was trying to solve the multi-Builder cache problem with this
- name: Build and push Docker image (Chromium-bundled version)
uses: docker/build-push-action@v2
with:
context: .
build-args: PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=0
push: true
tags: ${{ steps.meta-chromium-bundled.outputs.tags }}
labels: ${{ steps.meta-chromium-bundled.outputs.labels }}
platforms: linux/amd64,linux/arm/v7,linux/arm64
cache-from: |
type=registry,ref=${{ secrets.DOCKER_USERNAME }}/rsshub:chromium-bundled
# type=gha,scope=docker-release # not needed, Docker automatically uses local cache from the builder
# type=registry,ref=${{ secrets.DOCKER_USERNAME }}/rsshub:buildcache
cache-to: type=inline,ref=${{ secrets.DOCKER_USERNAME }}/rsshub:chromium-bundled # inline cache is enough
@mytting
Inline cache embeds cache metadata into the image config. The layers in the image will be left untouched compared to the image with no cache information.
For more details, you need to know the cache hit/miss judging mechanism. Basically speaking, each step of each stage has its own metadata used by the mechanism. It is like a hash. For a RUN
statement or etc, it is the change of the statement itself that determines cache hit or miss. For a COPY
statement or etc, it is the changes of files copied that determine cache hit or miss. If a step in a stage faces cache miss, the following steps in this stage will be forced to face cache miss.
What inline cache does is push these metadata ("hash") along with the image to the registry. However, since the inline cache is incompatible with max
cache mode, only the metadata of the last stage will be pushed. Why? The built image only contains layers from the last stage. Without layers from previous stages, even though cache hits, no layers can be reused since they just do not exist. While with max
mode enabled, every layer (that is, the result of every step of every stage in Dockerfile
) along with its metadata will be exported and pushed.
There are two use case of inline cache:
Dockerfile
If you are interested in https://github.com/DIYgod/RSSHub/blob/eb79456f402b268d8aa5a4f25060b7bc8b6d10f6/.github/workflows/docker-release.yml#L84-L97, let me tell you why non-last stages have their caches stored locally.
In the previous build step of the workflow job, all stages have been built and cached both locally and remotely (you may inspect local caches by executing docker buildx du --verbose
). The build step you mentioned just changes the result of the last two stages in the Dockerfile
.
Since the last stage copies files from the penult stage, a little metadata from the penult stage is somehow inline-cached. As a result, the penult stage is able to hit the cache. The caches of the rest stages are from the local caches of buildx builder that were written in the previous build step of the workflow job (no need to specify cache-from
, they are automatically reused). Thus, the build step you mentioned is able to have everything cache-hit.
If you are trying to work around the cache issue, inline-cache may not be a nice choice if you use GHA to build your image since everything is cleared after the job finishes. Of course, unless your Dockerfile
is single-staged. Just a reminder, https://github.com/DIYgod/RSSHub/blob/eb79456f402b268d8aa5a4f25060b7bc8b6d10f6/.github/workflows/docker-release.yml has not been applied any workaround of the issue and still has this issue occurs from time to time. The workaround I've mentioned in https://github.com/docker/buildx/issues/1044#issuecomment-1120312230 deserves a try if you really need it.
Investigating a related issue with buildx, I have found that manfest content using multiplatform (amd64, arm64) randomly changes order. This comment is just intended as a possible pointer; however, it could be completely unrelated. Attached is a .diff of the the two manifest:
--- /tmp/meta-538b4.json 2022-06-20 22:39:33.302897680 -0600
+++ /tmp/meta-80e8a.json 2022-06-20 22:39:57.467873367 -0600
@@ -3,24 +3,24 @@
"manifest": {
"schemaVersion": 2,
"mediaType": "application/vnd.docker.distribution.manifest.list.v2+json",
- "digest": "sha256:538b4667e072b437a5ea1e0cd97c2b35d264fd887ef686879b0a20c777940c02",
+ "digest": "sha256:80e8a68eb9363d64eabdeaceb1226ae8b1794e39dd5f06b700bae9d8b1f356d5",
"size": 743,
"manifests": [
{
"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
- "digest": "sha256:cef1b67558700a59f4a0e616d314e05dc8c88074c4c1076fbbfd18cc52e6607b",
+ "digest": "sha256:2bc150cfc0d4b6522738b592205d16130f2f4cde8742cd5434f7c81d8d1b2908",
"size": 1367,
"platform": {
- "architecture": "arm64",
+ "architecture": "amd64",
"os": "linux"
}
},
{
"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
- "digest": "sha256:2bc150cfc0d4b6522738b592205d16130f2f4cde8742cd5434f7c81d8d1b2908",
+ "digest": "sha256:cef1b67558700a59f4a0e616d314e05dc8c88074c4c1076fbbfd18cc52e6607b",
"size": 1367,
"platform": {
- "architecture": "amd64",
+ "architecture": "arm64",
"os": "linux"
}
}
Hi, I've just observed a somewhat similar behaviour there: I'm building different platforms on a matrix (because https://github.com/docker/build-push-action/issues/826, probably due to buildkit too) and the build which finish last overwrite the previous.
My config:
jobs:
build-image:
strategy:
matrix:
arch:
- amd64
- arm64
runs-on: ubuntu-latest
steps:
[...]
-
name: Build image
uses: docker/build-push-action@v4
with:
cache-from: type=gha
cache-to: type=gha,mode=max
context: .
load: true
platforms: linux/${{ matrix.arch }}
tags: ${{ steps.login-ecr.outputs.registry }}/main/frontend:${{ github.sha }}-${{ matrix.arch }}
what I can observe in a 3 runs w/ identical repo contents:
run | amd | arm | notes |
---|---|---|---|
1 | ++ | +++++ | empty caches |
2 | ++ | cached | arm is now cached |
3 | cached | +++++ | amd is now cached |
it would be very useful if restore key would preserve arch so the cache would live separately (not that different arches should be mixed anyway, imho).
Thank you
it seems that is sometimes works for registry cache and sometimes it fails.
works: https://github.com/renovatebot/docker-renovate/actions/runs/5530222002/jobs/10089343401 fails: https://github.com/renovatebot/docker-renovate/actions/runs/5530341846/jobs/10090968895
It's arm64 most of the time (at least when i noticed it).
TL;DR
https://github.com/moby/buildkit/issues/2822#issuecomment-1113920626
In https://github.com/moby/buildkit/issues/2758#issuecomment-1088288408, @tonistiigi said that
If someone has something similar with a single node then create a separate issue with a reproducer
so here it is.I am facing almost the same issue but what I use is GitHub Actions which should only have a single-node builder.
In my case, I need to build for 3 platforms (
linux/amd64
,linux/arm/v7
,linux/arm64
). Each time, there are randomly 1 or 2 platforms entirely unable to fetch their cache and must start over.Build settings
(I added linebreaks to make it readable.)
The node
How to reproduce
linux/amd64
got.FYI
Workflow YAML Dockerfile
This issue seems to exist when the cache type is
gha
, but I am not so sure if this is because of the cache size limit of GitHub Actions.