Open tmcgilchrist opened 1 year ago
The most frequent category of error I've seen in my brief time monitoring this stuff so far is
#66 exporting to image
#66 sha256:e8c613e07b0b7ff33893b694f7759a10d42e180f2b4dc349fb57dc6b71dcab00
#66 exporting layers
#66 exporting layers 21.8s done
#66 writing image sha256:03412a710050776f1862e9fd56b91e966e2858a71dc9f1dd821ffaf2aacc48f5 done
#66 DONE 21.8s
Pushing "sha256:184fd11abc04659d6ab0071aa1737ca3bcacee1f6b612807a6e6d5c937ece74b" to "ocurrent/opam-staging:ubuntu-20.04-opam-amd64" as user "ocurrentbuilder"
Login Succeeded
The push refers to repository [docker.io/ocurrent/opam-staging]
f3ef22358981: Preparing
error parsing HTTP 400 response body: invalid character '<' looking for beginning of value: "<html><body><h1>400 Bad request</h1>\nYour browser sent an invalid request.\n</body></html>\n\n"
docker-push failed with exit-code 1
2024-07-10 22:04.15: Job failed: Failed: Build failed
2024-07-10 22:04.15: Log analysis:
2024-07-10 22:04.15: >>> docker-push failed (score = 20)
2024-07-10 22:04.15: docker-push failed
2024-07-10 22:06.27: Will push staging image to ocurrent/opam-staging:debian-11-ocaml-4.03-i386
...
2024-07-10 22:06.27: Using cache hint "4.03.0-i386-ocurrent/opam-staging@sha256:0d421a01a2b832eaedec31c05dd0a87c337f036465a21e2b2e8af3f119b7578f"
2024-07-10 22:06.27: Waiting for resource in pool OCluster
2024-07-10 22:06.27: Waiting for worker…
2024-07-10 22:31.06: Got resource from pool OCluster
Building on x86-bm-c19.sw.ocaml.org
#2 [internal] load .dockerignore
#2 sha256:76716ffcb3cd99c3c374f52e5a45d9687189bdc321ad01196ed7d303fd040a64
#2 transferring context: 2B done
#2 DONE 0.4s
#1 [internal] load build definition from Dockerfile
#1 sha256:d1bbe7c7ab4dfa90070df180f90f841aeea20b486293a65facddf4ce6a55344f
#1 transferring dockerfile: 615B done
#1 DONE 0.3s
#3 resolve image config for docker.io/docker/dockerfile:1
#3 sha256:ac072d521901222eeef550f52282877f196e16b0247844be9ceb1ccc1eac391d
#3 DONE 1.7s
#4 docker-image://docker.io/docker/dockerfile:1@sha256:e87caa74dcb7d46cd820352bfea12591f3dba3ddc4285e19c7dcd13359f7cefd
#4 sha256:971261c9ec3d04b863c2e7e2301e85e136e954ddc12cdaba999b549fa96d15de
#4 CACHED
failed to solve with frontend dockerfile.v0: failed to solve with frontend gateway.v0: frontend grpc server closed unexpectedly
docker-build failed with exit-code 1
2024-07-10 22:32.14: Job failed: Failed: Build failed
Network issues when fetching sources is another source of flakey failure. See https://github.com/tarides/infrastructure/issues/338#issuecomment-2229229672
This error happens during execution of opam. E.g.,
#9 [2/5] RUN opam switch create 4.12 --packages=ocaml-variants.4.12.1+options,ocaml-options-only-fp
#9 sha256:274702d28af2649859867b3e2c572ebe7f008e65afcd884b035e062145beeafa
#9 7.654
#9 7.654 <><> Gathering sources ><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
#9 8.320 [ocaml-config.2/gen_ocaml_config.ml.in] downloaded from https://raw.githubusercontent.com/ocaml/opam-source-archives/main/patches/ocaml-config/gen_ocaml_config.ml.in.2
#9 25.79 [ocaml-variants.4.12.1+options] downloaded from https://github.com/ocaml/ocaml/archive/4.12.1.tar.gz
#9 27.11 [ocaml-variants.4.12.1+options/alt-signal-stack.patch] downloaded from https://github.com/ocaml/ocaml/commit/1eeb0e7fe595f5f9e1ea1edbdf785ff3b49feeeb.patch?full_index=1
#9 27.32 [ocaml-variants.4.12.1+options/ocaml-variants.install] downloaded from https://raw.githubusercontent.com/ocaml/opam-source-archives/main/patches/ocaml-variants/ocaml-variants.install
#9 27.32 Switch initialisation failed: clean up? ('n' will leave the switch partially installed) [Y/n] y
#9 27.33 [ERROR] The sources of the following couldn't be obtained, aborting:
#9 27.33 - ocaml-config.2: Curl failed
#9 27.33
#9 ERROR: executor failed running [/bin/sh -c opam switch create 4.12 --packages=ocaml-variants.4.12.1+options,ocaml-options-only-fp]: exit code: 40
------
> [2/5] RUN opam switch create 4.12 --packages=ocaml-variants.4.12.1+options,ocaml-options-only-fp:
------
executor failed running [/bin/sh -c opam switch create 4.12 --packages=ocaml-variants.4.12.1+options,ocaml-options-only-fp]: exit code: 40
docker-build failed with exit-code 1
2024-07-15 15:24.03: Job failed: Failed: Build failed
2024-07-15 15:24.03: Log analysis:
2024-07-15 15:24.03: >>> The sources of the following couldn't be obtained, aborting:
#9 27.33 - ocaml-config.2: Curl failed (score = 50)
2024-07-15 15:24.03: Source download failed for ocaml-config.2: Curl failed
Notes from a discussion with @mtelvers today:
Lwt
and would only help us for current's we are implementing. However, our current failures here are happening in currents provided by Ocurrent
.So our next step here is open an issue upstream to discuss and evaluate between those two options.
The most frequent case of this we have been coping with has been solved, going by this week's builds, which, afaik, all completed without any need for restarts or intervention, save for the known issues on ocaml <4.08 for some distros.
I'm going to let this fall back in the backlog then until we are troubled by new problems.
Authentication errors due to networking issues or transient server-side problems are another class of failure that would benefit from retries (see https://github.com/tarides/infrastructure/issues/397).
Sep 25 00:27:15 x86-bm-c8.sw.ocaml.org dockerd[1862]: time="2024-09-25T00:27:15.646462710Z" level=info msg="Attempting next endpoint for pull after error: errors:\nunauthorized: authentication required\nunauthorized: authent>
Sep 25 00:27:15 x86-bm-c8.sw.ocaml.org dockerd[1862]: time="2024-09-25T00:27:15.646505824Z" level=info msg="Ignoring extra error returned from registry: unauthorized: authentication required"
Sep 25 00:27:15 x86-bm-c8.sw.ocaml.org dockerd[1862]: time="2024-09-25T00:27:15.648370147Z" level=error msg="Handler for POST /v1.41/images/create returned error: unauthorized: authentication required"
Sep 25 00:27:17 x86-bm-c8.sw.ocaml.org dockerd[1862]: time="2024-09-25T00:27:17.787262578Z" level=info msg="Attempting next endpoint for pull after error: errors:\nunauthorized: authentication required\nunauthorized: authent>
Sep 25 00:27:17 x86-bm-c8.sw.ocaml.org dockerd[1862]: time="2024-09-25T00:27:17.787333895Z" level=info msg="Ignoring extra error returned from registry: unauthorized: authentication required"
Sep 25 00:27:17 x86-bm-c8.sw.ocaml.org dockerd[1862]: time="2024-09-25T00:27:17.790197382Z" level=error msg="Handler for POST /v1.41/images/create returned error: unauthorized: authentication required"
Base image builder regularly errors on this transient issue:
It would be useful to immediately retry on known flakey errors.
Prerequisite
Known flakey errors
Flakey errors on
docker-build
:dial tcp: lookup registry-1.docker.io: Temporary failure in name resolution
@ https://github.com/ocurrent/docker-base-images/issues/211#issue-1607610301failed to solve with frontend dockerfile.v0: failed to solve with frontend gateway.v0: frontend grpc server closed unexpectedly
@ https://github.com/ocurrent/docker-base-images/issues/211#issuecomment-2221716687Source download failed for (.*): Curl failed
@ https://github.com/ocurrent/docker-base-images/issues/211#issuecomment-2229370077Flakey errors on
docker-push
:error parsing HTTP 400 response body: invalid character '<' looking for beginning of value: "<html><body><h1>400 Bad request</h1>\nYour browser sent an invalid request.\n</body></html>\n\n"
@ https://github.com/ocurrent/docker-base-images/issues/211#issuecomment-2222768988Flakey errors on docker authentication: