ocaml / infrastructure

WIki to hold the information about the machine resources available to OCaml.org
40 stars 9 forks source link

curl failures all over the place #128

Closed mseri closed 3 months ago

mseri commented 3 months ago

They look like

#=== ERROR while fetching sources for yojson.2.2.1 ============================#
OpamSolution.Fetch_fail("https://github.com/ocaml-community/yojson/releases/download/2.2.1/yojson-2.2.1.tbz (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /home/opam/.opam/4.13/.opam-switch/sources/yojson.2.2.1/yojson-2.2.1.tbz.part -- https://github.com/ocaml-community/yojson/releases/download/2.2.1/yojson-2.2.1.tbz\" exited with code 6)")

#=== ERROR while fetching sources for uint.2.0.1 ==============================#
OpamSolution.Fetch_fail("https://github.com/andrenth/ocaml-uint/archive/2.0.1.tar.gz (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /home/opam/.opam/4.13/.opam-switch/sources/uint.2.0.1/2.0.1.tar.gz.part -- https://github.com/andrenth/ocaml-uint/archive/2.0.1.tar.gz\" exited with code 6)")

#=== ERROR while fetching sources for topkg.1.0.7 =============================#
OpamSolution.Fetch_fail("https://erratique.ch/software/topkg/releases/topkg-1.0.7.tbz (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /home/opam/.opam/4.13/.opam-switch/sources/topkg.1.0.7/topkg-1.0.7.tbz.part -- https://erratique.ch/software/topkg/releases/topkg-1.0.7.tbz\" exited with code 6)")

#=== ERROR while fetching sources for stdlib-shims.0.3.0 ======================#
OpamSolution.Fetch_fail("https://github.com/ocaml/stdlib-shims/releases/download/0.3.0/stdlib-shims-0.3.0.tbz (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /home/opam/.opam/4.13/.opam-switch/sources/stdlib-shims.0.3.0/stdlib-shims-0.3.0.tbz.part -- https://github.com/ocaml/stdlib-shims/releases/download/0.3.0/stdlib-shims-0.3.0.tbz\" exited with code 6)")

#=== ERROR while fetching sources for stdint.0.7.2 ============================#
OpamSolution.Fetch_fail("https://github.com/andrenth/ocaml-stdint/releases/download/0.7.2/stdint-0.7.2.tbz (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /home/opam/.opam/4.13/.opam-switch/sources/stdint.0.7.2/stdint-0.7.2.tbz.part -- https://github.com/andrenth/ocaml-stdint/releases/download/0.7.2/stdint-0.7.2.tbz\" exited with code 6)")

#=== ERROR while fetching sources for seq.base ================================#
OpamSolution.Fetch_fail("https://raw.githubusercontent.com/ocaml/opam-source-archives/main/patches/seq/META.seq (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /tmp/opam-7-cb0ac2/META.seq.part -- https://raw.githubusercontent.com/ocaml/opam-source-archives/main/patches/seq/META.seq\" exited with code 6)")

#=== ERROR while fetching sources for rresult.0.7.0 ===========================#
OpamSolution.Fetch_fail("https://erratique.ch/software/rresult/releases/rresult-0.7.0.tbz (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /home/opam/.opam/4.13/.opam-switch/sources/rresult.0.7.0/rresult-0.7.0.tbz.part -- https://erratique.ch/software/rresult/releases/rresult-0.7.0.tbz\" exited with code 6)")

#=== ERROR while fetching sources for result.1.5 ==============================#
OpamSolution.Fetch_fail("https://github.com/janestreet/result/releases/download/1.5/result-1.5.tbz (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /home/opam/.opam/4.13/.opam-switch/sources/result.1.5/result-1.5.tbz.part -- https://github.com/janestreet/result/releases/download/1.5/result-1.5.tbz\" exited with code 6)")

#=== ERROR while fetching sources for re.1.11.0 ===============================#
OpamSolution.Fetch_fail("https://github.com/ocaml/ocaml-re/releases/download/1.11.0/re-1.11.0.tbz (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /home/opam/.opam/4.13/.opam-switch/sources/re.1.11.0/re-1.11.0.tbz.part -- https://github.com/ocaml/ocaml-re/releases/download/1.11.0/re-1.11.0.tbz\" exited with code 6)")

#=== ERROR while fetching sources for pcre2.7.5.2 =============================#
OpamSolution.Fetch_fail("https://github.com/camlp5/pcre2-ocaml/releases/download/7.5.2/pcre2-7.5.2.tbz (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /home/opam/.opam/4.13/.opam-switch/sources/pcre2.7.5.2/pcre2-7.5.2.tbz.part -- https://github.com/camlp5/pcre2-ocaml/releases/download/7.5.2/pcre2-7.5.2.tbz\" exited with code 6)")

#=== ERROR while fetching sources for ounit.2.2.7 and ounit2.2.2.7 ============#
OpamSolution.Fetch_fail("https://github.com/gildor478/ounit/releases/download/v2.2.7/ounit-2.2.7.tbz (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /tmp/opam-7-c68912/ounit-2.2.7.tbz.part -- https://github.com/gildor478/ounit/releases/download/v2.2.7/ounit-2.2.7.tbz\" exited with code 6)")

#=== ERROR while fetching sources for ocamlgraph.2.1.0 ========================#
OpamSolution.Fetch_fail("https://github.com/backtracking/ocamlgraph/releases/download/2.1.0/ocamlgraph-2.1.0.tbz (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /home/opam/.opam/4.13/.opam-switch/sources/ocamlgraph.2.1.0/ocamlgraph-2.1.0.tbz.part -- https://github.com/backtracking/ocamlgraph/releases/download/2.1.0/ocamlgraph-2.1.0.tbz\" exited with code 6)")

#=== ERROR while fetching sources for ocamlfind.1.9.6 =========================#
OpamSolution.Fetch_fail("http://download.camlcity.org/download/findlib-1.9.6.tar.gz (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /home/opam/.opam/4.13/.opam-switch/sources/ocamlfind.1.9.6/findlib-1.9.6.tar.gz.part -- http://download.camlcity.org/download/findlib-1.9.6.tar.gz\" exited with code 6)")

#=== ERROR while fetching sources for ocamlbuild.0.14.3 =======================#
OpamSolution.Fetch_fail("https://github.com/ocaml/ocamlbuild/archive/refs/tags/0.14.3.tar.gz (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /home/opam/.opam/4.13/.opam-switch/sources/ocamlbuild.0.14.3/0.14.3.tar.gz.part -- https://github.com/ocaml/ocamlbuild/archive/refs/tags/0.14.3.tar.gz\" exited with code 6)")

#=== ERROR while fetching sources for not-ocamlfind.0.13 ======================#
OpamSolution.Fetch_fail("https://github.com/chetmurthy/not-ocamlfind/archive/refs/tags/0.13.tar.gz (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /home/opam/.opam/4.13/.opam-switch/sources/not-ocamlfind.0.13/0.13.tar.gz.part -- https://github.com/chetmurthy/not-ocamlfind/archive/refs/tags/0.13.tar.gz\" exited with code 6)")

#=== ERROR while fetching sources for logs.0.7.0 ==============================#
OpamSolution.Fetch_fail("https://erratique.ch/software/logs/releases/logs-0.7.0.tbz (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /home/opam/.opam/4.13/.opam-switch/sources/logs.0.7.0/logs-0.7.0.tbz.part -- https://erratique.ch/software/logs/releases/logs-0.7.0.tbz\" exited with code 6)")

#=== ERROR while fetching sources for fpath.0.7.3 =============================#
OpamSolution.Fetch_fail("https://erratique.ch/software/fpath/releases/fpath-0.7.3.tbz (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /home/opam/.opam/4.13/.opam-switch/sources/fpath.0.7.3/fpath-0.7.3.tbz.part -- https://erratique.ch/software/fpath/releases/fpath-0.7.3.tbz\" exited with code 6)")

#=== ERROR while fetching sources for fmt.0.9.0 ===============================#
OpamSolution.Fetch_fail("https://erratique.ch/software/fmt/releases/fmt-0.9.0.tbz (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /home/opam/.opam/4.13/.opam-switch/sources/fmt.0.9.0/fmt-0.9.0.tbz.part -- https://erratique.ch/software/fmt/releases/fmt-0.9.0.tbz\" exited with code 6)")

#=== ERROR while fetching sources for dune.3.15.3 and dune-configurator.3.15.3 #
OpamSolution.Fetch_fail("https://github.com/ocaml/dune/releases/download/3.15.3/dune-3.15.3.tbz (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /tmp/opam-7-819505/dune-3.15.3.tbz.part -- https://github.com/ocaml/dune/releases/download/3.15.3/dune-3.15.3.tbz\" exited with code 6)")

#=== ERROR while fetching sources for csexp.1.5.2 =============================#
OpamSolution.Fetch_fail("https://github.com/ocaml-dune/csexp/releases/download/1.5.2/csexp-1.5.2.tbz (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /home/opam/.opam/4.13/.opam-switch/sources/csexp.1.5.2/csexp-1.5.2.tbz.part -- https://github.com/ocaml-dune/csexp/releases/download/1.5.2/csexp-1.5.2.tbz\" exited with code 6)")

#=== ERROR while fetching sources for cppo.1.6.9 ==============================#
OpamSolution.Fetch_fail("https://github.com/ocaml-community/cppo/archive/v1.6.9.tar.gz (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /home/opam/.opam/4.13/.opam-switch/sources/cppo.1.6.9/v1.6.9.tar.gz.part -- https://github.com/ocaml-community/cppo/archive/v1.6.9.tar.gz\" exited with code 6)")

#=== ERROR while fetching sources for camlp5-buildscripts.0.03 ================#
OpamSolution.Fetch_fail("https://github.com/camlp5/camlp5-buildscripts/archive/refs/tags/0.03.tar.gz (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /home/opam/.opam/4.13/.opam-switch/sources/camlp5-buildscripts.0.03/0.03.tar.gz.part -- https://github.com/camlp5/camlp5-buildscripts/archive/refs/tags/0.03.tar.gz\" exited with code 6)")

#=== ERROR while fetching sources for camlp5.8.03.00 ==========================#
OpamSolution.Fetch_fail("https://github.com/camlp5/camlp5/archive/refs/tags/8.03.00.tar.gz (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /home/opam/.opam/4.13/.opam-switch/sources/camlp5.8.03.00/8.03.00.tar.gz.part -- https://github.com/camlp5/camlp5/archive/refs/tags/8.03.00.tar.gz\" exited with code 6)")

#=== ERROR while fetching sources for camlp-streams.5.0.1 =====================#
OpamSolution.Fetch_fail("https://github.com/ocaml/camlp-streams/archive/v5.0.1.tar.gz (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /home/opam/.opam/4.13/.opam-switch/sources/camlp-streams.5.0.1/v5.0.1.tar.gz.part -- https://github.com/ocaml/camlp-streams/archive/v5.0.1.tar.gz\" exited with code 6)")

#=== ERROR while fetching sources for bos.0.2.1 ===============================#
OpamSolution.Fetch_fail("https://erratique.ch/software/bos/releases/bos-0.2.1.tbz (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /home/opam/.opam/4.13/.opam-switch/sources/bos.0.2.1/bos-0.2.1.tbz.part -- https://erratique.ch/software/bos/releases/bos-0.2.1.tbz\" exited with code 6)")

#=== ERROR while fetching sources for astring.0.8.5 ===========================#
OpamSolution.Fetch_fail("https://erratique.ch/software/astring/releases/astring-0.8.5.tbz (Curl failed: \"/usr/bin/curl --write-out %{http_code}\\\\n --retry 3 --retry-delay 2 --user-agent opam/2.2.0~beta3~dev -L -o /home/opam/.opam/4.13/.opam-switch/sources/astring.0.8.5/astring-0.8.5.tbz.part -- https://erratique.ch/software/astring/releases/astring-0.8.5.tbz\" exited with code 6)")

See e.g. https://github.com/ocaml/opam-repository/pull/26044

mtelvers commented 3 months ago

These machines are geographically distributed, so there is no obvious common networking factor. Testing curl on a random sample of failing machines has also worked.

shonfeder commented 3 months ago

I'm gonna close this for now, since it seems to have been due to a transient networking issue. Let's reopen it if the problem recurs.

hannesm commented 3 months ago

It happens to opam-repo-ci quite a lot. I'm curious, since you also run the opam.ocaml.org with all the archives (but use the opam-repository as git repository in your CI images), why don't you put a line 'archive-mirrors: "https://opam.ocaml.org/cache"' into the ~/.opam/config file?

Since opam 2.1.5 this is respected and will then use the opam.ocaml.org host for requesting archives (instead of going to github or some other overloaded host)... Now I'm not sure anymore which opam versions your images have and why.

shonfeder commented 3 months ago

Ah, I didn't realize this was chronic. That sounds like a good idea to me. I would reopen the issue but I don't have the permissions needed to do so. Are you able to @mtelvers ?

hannesm commented 3 months ago

well, I'm not sure about "chronic". What I see e.g. https://opam.ci.ocaml.org/github/ocaml/opam-repository/commit/c2ffca9c419985ae9191e0c272f33654d46a8eac is various failures, including 57 curl failures.

chetmurthy commented 3 months ago

I've rerun these failing jobs a few times (separated by several days, in hope that the network problem would get fixed), and they're still failing: https://opam.ci.ocaml.org/github/ocaml/opam-repository/commit/b69de340a94bb5bd475e25d32c407d590e87d37d

I took one of the tests and repro-ed locally (cut-and-paste docker script, run it), and it worked fine: https://opam.ci.ocaml.org/github/ocaml/opam-repository/commit/b69de340a94bb5bd475e25d32c407d590e87d37d/variant/compilers,4.13,pa_ppx.0.15,tests Sadly, it takes 1008sec to run on my not-weakling AMD Radeon box. sigh.

So it does appear to be a problem with the CI infra.

I suspect that one could craft a custom OPAM package that would elicit the bug, and if that would be useful, I can do that. But I'd only do that if it were useful, b/c ..... given the long waits for jobs to get scheduled, I suspect that it'd take a good number of weeks to narrow down to a minimal test that elicits the problem.

mtelvers commented 3 months ago

I suspect that this problem is caused by a rate limit from the source websites.

obuilder has a local cache on each worker that prevents repeated fetches of the same file that opam needs to download. This can typically be seen in action on the retrieved lines, as they include (cached).

<><> Processing actions <><><><><><><><><><><><><><><><><><><><><><><><><><><><>
-> retrieved angstrom.0.16.0  (cached)
-> retrieved astring.0.8.5  (cached)

The logs show we are recompiling OCaml. This is very curious to me.

The following actions will be performed:
=== recompile 4 packages
  - recompile ocaml               4.14.2          [upstream or system changes]
  - recompile ocaml-base-compiler 4.14.2 (pinned) [upstream or system changes]

The failed curl commands come from the build Makefile. These are not cached as they are straight invocations of curl.

Looking back in the old logs using grep for recompile ocaml-base-compiler shows that this behaviour began on 10th May. There are a number of commits to the base compiler packages on that day which may be the source of this problem.

mtelvers commented 3 months ago

This issue occurs when the opam file for the ocaml-base-compiler differs from the one included in the Docker base image.

It has stopped and started several times as PRs are merged and base images have been rebuilt.

mtelvers commented 3 months ago

I have created a PR https://github.com/ocaml/opam/pull/6032 to work around ocaml/ocaml#13237 and while I wait for it to be merged, I have hacked up a commit on ocurrent/ocaml-dockerfile and used that to rebuild the base images using my own instance.

I am pleased to report that this is having a beneficial result. Between midnight and 6am, opam-repo-ci rebuilt the compiler 32,000 times. In the following 6 hours, that figure has dropped to 147.

chetmurthy commented 3 months ago

I'm not sure if this is supposed to have solved the problem, but I figured I'd report back that the problem persists. I reran this test just now, and it still fails: https://opam.ci.ocaml.org/github/ocaml/opam-repository/commit/b69de340a94bb5bd475e25d32c407d590e87d37d/variant/compilers,4.13,pa_ppx.0.15,tests With (of course) the expect long list of curl failures.

mtelvers commented 3 months ago

Thanks for your report. Unfortunately, I must have missed some of the base images and/or docker peek jobs, as a small percentage of jobs, such as the one you linked, are still rebuilding the compiler and generating the curl failures. Since my last update, the opam PR has been merged, and the base image builder is running (https://images.ci.ocaml.org). I'll check on the progress in the morning.

chetmurthy commented 3 months ago

Thanks for replying so quickly! Can you update this thread when they're all finished (I clicked-thru to that link, but don't know how to interpret what's shown) and I'll rerun the CI jobs and report back on what happens ?

mtelvers commented 3 months ago

@chetmurthy I've rebuilt the failed jobs, so we now only have a single curl failure. This remaining one is because the Debian 10 base images won't rebuild, as Debian has dropped their ppc64le and s390x mirror ahead of deprecating Debian 10 at the end of the month. I've created PR https://github.com/ocurrent/ocaml-dockerfile/pull/209 to remove these Debian 10 variants.

shonfeder commented 3 months ago

Kudos to @mtelvers for his work on recovering from this, and on driving forward fixes for the root cause. Followups (including ways to catch this kind of thing earlier) are being tracked in other issues.

Thanks for the report @mseri and @chetmurthy :pray: