ocurrent / docker-base-images

Generate various Docker ocaml images
https://images.ci.ocaml.org
MIT License
30 stars 19 forks source link

Retry on known flakey errors #211

Open tmcgilchrist opened 1 year ago

tmcgilchrist commented 1 year ago

Base image builder regularly errors on this transient issue:

failed to solve with frontend dockerfile.v0: failed to solve with frontend gateway.v0: rpc error: code = Unknown desc = ocurrent/opam-staging@sha256:90f036ba79b70d23d08aad99dcaaf9594bc953c9b9a16dd0aa7cee4894939512: failed to do request: Head "https://registry-1.docker.io/v2/ocurrent/opam-staging/manifests/sha256:90f036ba79b70d23d08aad99dcaaf9594bc953c9b9a16dd0aa7cee4894939512": dial tcp: lookup registry-1.docker.io: Temporary failure in name resolution
docker-build failed with exit-code 1
2023-03-02 15:27.29: Job failed: Failed: Build failed

It would be useful to immediately retry on known flakey errors.

Prerequisite

Known flakey errors

Flakey errors on docker-build:

Flakey errors on docker-push:

Flakey errors on docker authentication:

shonfeder commented 3 months ago

The most frequent category of error I've seen in my brief time monitoring this stuff so far is

#66 exporting to image
#66 sha256:e8c613e07b0b7ff33893b694f7759a10d42e180f2b4dc349fb57dc6b71dcab00
#66 exporting layers
#66 exporting layers 21.8s done
#66 writing image sha256:03412a710050776f1862e9fd56b91e966e2858a71dc9f1dd821ffaf2aacc48f5 done
#66 DONE 21.8s
Pushing "sha256:184fd11abc04659d6ab0071aa1737ca3bcacee1f6b612807a6e6d5c937ece74b" to "ocurrent/opam-staging:ubuntu-20.04-opam-amd64" as user "ocurrentbuilder"
Login Succeeded
The push refers to repository [docker.io/ocurrent/opam-staging]
f3ef22358981: Preparing
error parsing HTTP 400 response body: invalid character '<' looking for beginning of value: "<html><body><h1>400 Bad request</h1>\nYour browser sent an invalid request.\n</body></html>\n\n"
docker-push failed with exit-code 1
2024-07-10 22:04.15: Job failed: Failed: Build failed
2024-07-10 22:04.15: Log analysis:
2024-07-10 22:04.15: >>> docker-push failed (score = 20)
2024-07-10 22:04.15: docker-push failed
shonfeder commented 3 months ago
2024-07-10 22:06.27: Will push staging image to ocurrent/opam-staging:debian-11-ocaml-4.03-i386
...
2024-07-10 22:06.27: Using cache hint "4.03.0-i386-ocurrent/opam-staging@sha256:0d421a01a2b832eaedec31c05dd0a87c337f036465a21e2b2e8af3f119b7578f"
2024-07-10 22:06.27: Waiting for resource in pool OCluster
2024-07-10 22:06.27: Waiting for worker…
2024-07-10 22:31.06: Got resource from pool OCluster
Building on x86-bm-c19.sw.ocaml.org
#2 [internal] load .dockerignore
#2 sha256:76716ffcb3cd99c3c374f52e5a45d9687189bdc321ad01196ed7d303fd040a64
#2 transferring context: 2B done
#2 DONE 0.4s

#1 [internal] load build definition from Dockerfile
#1 sha256:d1bbe7c7ab4dfa90070df180f90f841aeea20b486293a65facddf4ce6a55344f
#1 transferring dockerfile: 615B done
#1 DONE 0.3s

#3 resolve image config for docker.io/docker/dockerfile:1
#3 sha256:ac072d521901222eeef550f52282877f196e16b0247844be9ceb1ccc1eac391d
#3 DONE 1.7s

#4 docker-image://docker.io/docker/dockerfile:1@sha256:e87caa74dcb7d46cd820352bfea12591f3dba3ddc4285e19c7dcd13359f7cefd
#4 sha256:971261c9ec3d04b863c2e7e2301e85e136e954ddc12cdaba999b549fa96d15de
#4 CACHED
failed to solve with frontend dockerfile.v0: failed to solve with frontend gateway.v0: frontend grpc server closed unexpectedly
docker-build failed with exit-code 1
2024-07-10 22:32.14: Job failed: Failed: Build failed
shonfeder commented 3 months ago

Network issues when fetching sources is another source of flakey failure. See https://github.com/tarides/infrastructure/issues/338#issuecomment-2229229672

This error happens during execution of opam. E.g.,

#9 [2/5] RUN opam switch create 4.12 --packages=ocaml-variants.4.12.1+options,ocaml-options-only-fp
#9 sha256:274702d28af2649859867b3e2c572ebe7f008e65afcd884b035e062145beeafa
#9 7.654 
#9 7.654 <><> Gathering sources ><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
#9 8.320 [ocaml-config.2/gen_ocaml_config.ml.in] downloaded from https://raw.githubusercontent.com/ocaml/opam-source-archives/main/patches/ocaml-config/gen_ocaml_config.ml.in.2
#9 25.79 [ocaml-variants.4.12.1+options] downloaded from https://github.com/ocaml/ocaml/archive/4.12.1.tar.gz
#9 27.11 [ocaml-variants.4.12.1+options/alt-signal-stack.patch] downloaded from https://github.com/ocaml/ocaml/commit/1eeb0e7fe595f5f9e1ea1edbdf785ff3b49feeeb.patch?full_index=1
#9 27.32 [ocaml-variants.4.12.1+options/ocaml-variants.install] downloaded from https://raw.githubusercontent.com/ocaml/opam-source-archives/main/patches/ocaml-variants/ocaml-variants.install
#9 27.32 Switch initialisation failed: clean up? ('n' will leave the switch partially installed) [Y/n] y
#9 27.33 [ERROR] The sources of the following couldn't be obtained, aborting:
#9 27.33           - ocaml-config.2: Curl failed
#9 27.33 
#9 ERROR: executor failed running [/bin/sh -c opam switch create 4.12 --packages=ocaml-variants.4.12.1+options,ocaml-options-only-fp]: exit code: 40
------
 > [2/5] RUN opam switch create 4.12 --packages=ocaml-variants.4.12.1+options,ocaml-options-only-fp:
------
executor failed running [/bin/sh -c opam switch create 4.12 --packages=ocaml-variants.4.12.1+options,ocaml-options-only-fp]: exit code: 40
docker-build failed with exit-code 1
2024-07-15 15:24.03: Job failed: Failed: Build failed
2024-07-15 15:24.03: Log analysis:
2024-07-15 15:24.03: >>> The sources of the following couldn't be obtained, aborting:
#9 27.33           - ocaml-config.2: Curl failed (score = 50)
2024-07-15 15:24.03: Source download failed for ocaml-config.2: Curl failed
shonfeder commented 3 months ago

Notes from a discussion with @mtelvers today:

So our next step here is open an issue upstream to discuss and evaluate between those two options.

shonfeder commented 1 month ago

The most frequent case of this we have been coping with has been solved, going by this week's builds, which, afaik, all completed without any need for restarts or intervention, save for the known issues on ocaml <4.08 for some distros.

I'm going to let this fall back in the backlog then until we are troubled by new problems.

shonfeder commented 2 weeks ago

Authentication errors due to networking issues or transient server-side problems are another class of failure that would benefit from retries (see https://github.com/tarides/infrastructure/issues/397).

Sep 25 00:27:15 x86-bm-c8.sw.ocaml.org dockerd[1862]: time="2024-09-25T00:27:15.646462710Z" level=info msg="Attempting next endpoint for pull after error: errors:\nunauthorized: authentication required\nunauthorized: authent>
Sep 25 00:27:15 x86-bm-c8.sw.ocaml.org dockerd[1862]: time="2024-09-25T00:27:15.646505824Z" level=info msg="Ignoring extra error returned from registry: unauthorized: authentication required"
Sep 25 00:27:15 x86-bm-c8.sw.ocaml.org dockerd[1862]: time="2024-09-25T00:27:15.648370147Z" level=error msg="Handler for POST /v1.41/images/create returned error: unauthorized: authentication required"
Sep 25 00:27:17 x86-bm-c8.sw.ocaml.org dockerd[1862]: time="2024-09-25T00:27:17.787262578Z" level=info msg="Attempting next endpoint for pull after error: errors:\nunauthorized: authentication required\nunauthorized: authent>
Sep 25 00:27:17 x86-bm-c8.sw.ocaml.org dockerd[1862]: time="2024-09-25T00:27:17.787333895Z" level=info msg="Ignoring extra error returned from registry: unauthorized: authentication required"
Sep 25 00:27:17 x86-bm-c8.sw.ocaml.org dockerd[1862]: time="2024-09-25T00:27:17.790197382Z" level=error msg="Handler for POST /v1.41/images/create returned error: unauthorized: authentication required"