Closed mseri closed 3 months ago
These machines are geographically distributed, so there is no obvious common networking factor. Testing curl
on a random sample of failing machines has also worked.
I'm gonna close this for now, since it seems to have been due to a transient networking issue. Let's reopen it if the problem recurs.
It happens to opam-repo-ci quite a lot. I'm curious, since you also run the opam.ocaml.org
with all the archives (but use the opam-repository as git repository in your CI images), why don't you put a line 'archive-mirrors: "https://opam.ocaml.org/cache"' into the ~/.opam/config file?
Since opam 2.1.5 this is respected and will then use the opam.ocaml.org host for requesting archives (instead of going to github or some other overloaded host)... Now I'm not sure anymore which opam versions your images have and why.
Ah, I didn't realize this was chronic. That sounds like a good idea to me. I would reopen the issue but I don't have the permissions needed to do so. Are you able to @mtelvers ?
well, I'm not sure about "chronic". What I see e.g. https://opam.ci.ocaml.org/github/ocaml/opam-repository/commit/c2ffca9c419985ae9191e0c272f33654d46a8eac is various failures, including 57 curl failures.
I've rerun these failing jobs a few times (separated by several days, in hope that the network problem would get fixed), and they're still failing: https://opam.ci.ocaml.org/github/ocaml/opam-repository/commit/b69de340a94bb5bd475e25d32c407d590e87d37d
I took one of the tests and repro-ed locally (cut-and-paste docker script, run it), and it worked fine: https://opam.ci.ocaml.org/github/ocaml/opam-repository/commit/b69de340a94bb5bd475e25d32c407d590e87d37d/variant/compilers,4.13,pa_ppx.0.15,tests Sadly, it takes 1008sec to run on my not-weakling AMD Radeon box. sigh.
So it does appear to be a problem with the CI infra.
I suspect that one could craft a custom OPAM package that would elicit the bug, and if that would be useful, I can do that. But I'd only do that if it were useful, b/c ..... given the long waits for jobs to get scheduled, I suspect that it'd take a good number of weeks to narrow down to a minimal test that elicits the problem.
I suspect that this problem is caused by a rate limit from the source websites.
obuilder
has a local cache on each worker that prevents repeated fetches of the same file that opam needs to download. This can typically be seen in action on the retrieved lines, as they include (cached).
<><> Processing actions <><><><><><><><><><><><><><><><><><><><><><><><><><><><>
-> retrieved angstrom.0.16.0 (cached)
-> retrieved astring.0.8.5 (cached)
The logs show we are recompiling OCaml. This is very curious to me.
The following actions will be performed:
=== recompile 4 packages
- recompile ocaml 4.14.2 [upstream or system changes]
- recompile ocaml-base-compiler 4.14.2 (pinned) [upstream or system changes]
The failed curl
commands come from the build Makefile
. These are not cached as they are straight invocations of curl
.
Looking back in the old logs using grep
for recompile ocaml-base-compiler
shows that this behaviour began on 10th May. There are a number of commits to the base compiler packages on that day which may be the source of this problem.
This issue occurs when the opam
file for the ocaml-base-compiler differs from the one included in the Docker base image.
It has stopped and started several times as PRs are merged and base images have been rebuilt.
I have created a PR https://github.com/ocaml/opam/pull/6032 to work around ocaml/ocaml#13237 and while I wait for it to be merged, I have hacked up a commit on ocurrent/ocaml-dockerfile and used that to rebuild the base images using my own instance.
I am pleased to report that this is having a beneficial result. Between midnight and 6am, opam-repo-ci rebuilt the compiler 32,000 times. In the following 6 hours, that figure has dropped to 147.
I'm not sure if this is supposed to have solved the problem, but I figured I'd report back that the problem persists. I reran this test just now, and it still fails: https://opam.ci.ocaml.org/github/ocaml/opam-repository/commit/b69de340a94bb5bd475e25d32c407d590e87d37d/variant/compilers,4.13,pa_ppx.0.15,tests With (of course) the expect long list of curl failures.
Thanks for your report. Unfortunately, I must have missed some of the base images and/or docker peek jobs, as a small percentage of jobs, such as the one you linked, are still rebuilding the compiler and generating the curl
failures. Since my last update, the opam PR has been merged, and the base image builder is running (https://images.ci.ocaml.org). I'll check on the progress in the morning.
Thanks for replying so quickly! Can you update this thread when they're all finished (I clicked-thru to that link, but don't know how to interpret what's shown) and I'll rerun the CI jobs and report back on what happens ?
@chetmurthy I've rebuilt the failed jobs, so we now only have a single curl
failure. This remaining one is because the Debian 10 base images won't rebuild, as Debian has dropped their ppc64le and s390x mirror ahead of deprecating Debian 10 at the end of the month. I've created PR https://github.com/ocurrent/ocaml-dockerfile/pull/209 to remove these Debian 10 variants.
Kudos to @mtelvers for his work on recovering from this, and on driving forward fixes for the root cause. Followups (including ways to catch this kind of thing earlier) are being tracked in other issues.
Thanks for the report @mseri and @chetmurthy :pray:
They look like
See e.g. https://github.com/ocaml/opam-repository/pull/26044