ocaml / infrastructure

WIki to hold the information about the machine resources available to OCaml.org
40 stars 9 forks source link

opam.ocaml.org does not get updated (since ~2 days) #48

Closed hannesm closed 6 months ago

hannesm commented 1 year ago

Dear Madam or Sir,

first of all thanks for running opam.ocaml.org as a community service. :)

I noticed from opam update that the opam.ocaml.org hosts are not getting updates since Sunday June 4th 19:04:41 2023 +0100 (commit 9681b042 according to the repo file of the opam.ocaml.org hosts).

I'm curious how to move here, is the infrastructure and its setup/deployment maybe a bit too involved (in terms of complexity, requiring GitHub, some machines to produce artifacts (docker images), Docker Hub for download and upload, and some other machines to execute things), esp. with the recent issues in this area: IPv6 outage, and failure to update some of the machines that serve the repository (missing ssh key).

Another question is whether you have monitoring of the service opam.ocaml.org (about the key things: online, replies to HTTP requests, serves an up-to-date archive), and if yes, is that online and available somewhere? (I suggest setting up a "status.opam.ocaml.org" with some information, and maybe post-mortens about the issues that happened in recent months.)

I hope this is the right repository to report this issue to, in case you've any questions or want to discuss this topic further, don't hesitate to reach out to me.

avsm commented 1 year ago

The failure here is due to a build error, not due to an infrastructure error. If anything, we need a bit more infrastructure to alert us when the opam.ocaml.org pushes fail (a Matrix channel would be ideal).

See deploy.ci.ocaml.org, and after clicking on the logs tab and looking through the table, I spotted https://deploy.ci.ocaml.org/job/2023-06-06/160459-ocluster-build-dcd1d2, which in turn shows that opam2web is failing:

 checking for OCaml findlib package unix... found
#37 3.289 checking for OCaml findlib package bigarray... found
#37 3.291 checking for OCaml findlib package re 1.9.0 or later... found 1.10.4
#37 3.297 checking for OCaml findlib package base64 3.1.0 or later... no
#37 3.298 checking for OCaml findlib package cmdliner... found
#37 3.300 checking for OCaml findlib package ocamlgraph... not found
#37 3.302 checking for OCaml findlib package cudf 0.7 or later... no
#37 3.303 checking for OCaml findlib package dose3.common 6.1 or later... no
#37 3.305 checking for OCaml findlib package dose3.algo 6.1 or later... no
#37 3.307 checking for OCaml findlib package opam-file-format 2.1.4 or later... no
#37 3.309 checking for OCaml findlib package spdx_licenses... not found
#37 3.311 checking for OCaml findlib package opam-0install-cudf 0.4 or later... no
#37 3.312 checking for OCaml findlib package jsonm... not found
#37 3.314 checking for OCaml findlib package uutf... found
#37 3.315 checking for OCaml findlib package sha... not found
#37 3.316 checking for OCaml findlib package swhid_core... not found
#37 3.318 checking for OCaml findlib package mccs 1.1+9 or later... no
#37 3.320 
#37 3.320 configure: error: Dependencies missing. Use --with-vendored-deps or --disable-checks
#37 ERROR: executor failed running [/bin/sh -c opam exec -- ./configure --without-mccs && opam exec -- make lib-ext && opam exec -- make]: exit code: 1

I've not root caused this further yet... /cc @mtelvers @tmcgilchrist

avsm commented 1 year ago

It looks like the deployer is building all branches of opam2web for some reason. That seems like it could be tightened down to just the live and staging branches.

Regarding this question by @hannesm:

Another question is whether you have monitoring of the service opam.ocaml.org (about the key things: online, replies to HTTP requests, serves an up-to-date archive), and if yes, is that online and available somewhere? (I suggest setting up a "status.opam.ocaml.org" with some information, and maybe post-mortens about the issues that happened in recent months.)

... the tracking issue is #31

tmcgilchrist commented 1 year ago

It looks like the deployer is building all branches of opam2web for some reason. That seems like it could be tightened down to just the live and staging branches.

That is happening by design, to check any PRs are deployable before merging to live or staging. We could remove that behaviour but I would advise against it. There are many stale branches on opam2web could they be cleaned up to just what is required?

The failure here is due to a build error, not due to an infrastructure error. If anything, we need a bit more infrastructure to alert us when the opam.ocaml.org pushes fail (a Matrix channel would be ideal).

Tracking issue is https://github.com/ocurrent/ocurrent-deployer/issues/111. Is there a Matrix channel / server available for posting messages to? Current plan was to post to the Slack channel for opam-maintainers. Usually the issue is the large size of the docker image created and the build timing out or getting rate limited by docker hub.

The longer term fix for the opam2web size issue is to move the documentation into the new ocaml.org website and have opam2web just build the index file.

avsm commented 1 year ago

Thanks @tmcgilchrist, the deployability checks do indeed make sense. I think the real blocker to debugging what's going on is the lack of historical build information, which I've posted up at https://github.com/ocurrent/ocurrent-deployer/issues/190. Without that, there's not much point having the web interface for the deployer as it's always only showing the current (and long-running) build. https://github.com/ocurrent/ocurrent-deployer/issues/190

I've set up a simple Matrix room on #ocaml-infra:recoil.org which we can use for notifications. Once it's working, we can alias to another homeserver (for redundancy) and then add it to the OCaml space.

mtelvers commented 1 year ago

How about removing the arm64 build, as we only deploy the x86_64 version? Both builds happen in parallel therefore, it wouldn't be any quicker, but both builds must succeed to proceed to the next stage of the pipeline so there would be one fewer dependency. Should save a bit of carbon too!

hannesm commented 1 year ago

First of all, thanks for fixing the update process (in case you did something, at least there was an update of the opam repository on opam.ocaml.org).

Second, I'll close this issue. I have the feeling that you're very convinced that the system and complexity is very necessary for any commit to the opam-repository, and shoveling huge docker images across the Internet for deployment is deeply necessary. Whereas my approach would be radically different: I'd try to find the minimal thing which needs to be done for an update (including building opam2web binaries and package them, with the grand goal to save resources (computation / network)). But since you're convinced of the technology and stack in use, I won't argue against it.

avsm commented 1 year ago

@hannesm, removing the Docker Hub from the equation to save resources is entirely in scope, especially if it saves resources and energy (which it will). It's a matter of smooth transition of the infrastructure and time, and Ocurrent can easily wrap any dataflow. I'd welcome a simpler future infrastructure than the existing one.

hannesm commented 1 year ago

Again, it is 2 days behind. While scrolling through "https://deploy.ci.ocaml.org/?repo=ocaml-opam/opam2web&", I can find two "jobs" (please excuse if you have other terminology) -- one being "ocurrent/opam.ocaml.org: live", the other "ocurrent/opam.ocaml.org: staging".

Somehow, one gets "live", the other "live-staging" branch of opam2web -- which both point to similar commits, and diverge from the master branch (is this intended?).

Now, in their "log output", there's a lot of stuff, but I'm curious that both logs have this line: Pushing "sha256:ccc1b6aa4f224fd9ee2dc4ce4140863e87d6c743e8e25c5f8b5b2e9612a2982c" to "ocurrentbuilder/staging:live-ocurrent-opam.ocaml.org-linux-x86_64" as user "ocurrentbuilder" Pushing "sha256:24c3d50de483b79c3e24cd54c981059b4437b971b651bc3be51a5714b7984f90" to "ocurrentbuilder/staging:live-ocurrent-opam.ocaml.org-linux-x86_64" as user "ocurrentbuilder"

For me, as somehow who doesn't know anything about docker and docker hub, it looks like they're racing pushing to the same tag remotely. Is this correct? I haven't had any luck to figure out what these "jobs" are actually supposed to do (apart from the graphical output which lacks all the details).

May it be, that, given the current pace of development of opam2web, restrict these two "jobs" to a single one? I also have a hard time to understand where / what is getting deployed if both push the same tag, and the host in mind is only "opam.ocaml.org" -- is there a "live" and "staging" subdomain? Is it worth it?

Is it possible for you to hand out an executable POSIX shell script that condenses the steps taken when "there is a new commit to opam-repository"? I'd love to take a look what is involved to get a clearer mind about the carbon footprint involved. With "ocurrent" and some docker scripting, I'm sure you can extract that. If not, a (single!) Dockerfile could be helpful as well.

Thanks for reading.

tmcgilchrist commented 1 year ago

@hannesm the build instructions are documented on https://github.com/ocaml-opam/opam2web#docker. What ocurrent is doing in this process is running that docker build with the lastest git version for opam repository and ocaml/platform-blog, and then deploying that.

If you want to run it locally use this command:

DOCKER_BUILDKIT=1 \
docker build -t opam2web  -f Dockerfile . --build-arg \
OPAM_GIT_SHA=42b392e634b2f2fc7e027070ccae412e55eba41b \
BLOG_GIT_SHA=356e7d2ea63d5945828b9c5421a007db125f1710

The build generates a large docker image with all the package documentation, which is what takes so much time to build and triggers the timeouts you are seeing. The plan is to move everything to ocaml.org documentation, and we can stop building that and just generate the opam index file which will be much faster. That work is being done under https://github.com/ocaml/infrastructure/issues/26 cc @tmattio

In the meantime the docker layers present in that Dockerfile could be optimised to avoid rebuilding and using cached layers. If you have some time and want to help with that, it would be appreciated.

Finally I've restarted the build and will keep an eye on it today.

hannesm commented 1 year ago

Thanks for your pointer. Unfortunately, there's no docker available on my operating system. I'm still confused by the Dockerfile you pointed to (so many FROM lines), and that it calls so much stuff (including the bin/opam-web.sh script which does yet another set of git clone and execute various other things).

So, good luck with that. From your message

with all the package documentation

do you mean the package index, as in https://opam.ocaml.org/packages/awa/, or is there other (API) documentation being built? Certainly I understand that the platform-blog and the opam documentation is put there.

Btw, do you have an idea why in the log output of both deployer jobs the following lines occur (as I mentioned above) - and do both live and staging race for the same tag (do these contain the same data?)?

Pushing "sha256:ccc1b6aa4f224fd9ee2dc4ce4140863e87d6c743e8e25c5f8b5b2e9612a2982c" to "ocurrentbuilder/staging:live-ocurrent-opam.ocaml.org-linux-x86_64" as user "ocurrentbuilder" Pushing "sha256:24c3d50de483b79c3e24cd54c981059b4437b971b651bc3be51a5714b7984f90" to "ocurrentbuilder/staging:live-ocurrent-opam.ocaml.org-linux-x86_64" as user "ocurrentbuilder"

tmcgilchrist commented 1 year ago

So, good luck with that.

Yeah, what can I say it isn't optimal and was only supposed to be in place for a short time while a better solution was being developed.

Do you mean the package index, as in https://opam.ocaml.org/packages/awa/, or is there other (API) documentation being built? Certainly I understand that the platform-blog and the opam documentation is put there.

Yes that is right, so it builds all of https://opam.ocaml.org/packages/* for all packages plus the platform-blog and opam documentation, as per your response. This will be resolved by https://github.com/ocaml/infrastructure/issues/26 which shouldn't be far away.

Do both live and staging race for the same tag (do these contain the same data?)?

They will be using different tags so there is no race, but most of the data will be the same. This isn't worth fixing since this whole setup will be replaced soon.

Briefly, the deployment is:

The extra docker pushes you're pointing to are a staging docker hub hosted locally on the machine for caching. Before pointing out the obvious waste in pushing images around, the services deployed using docker service .. require the images are on docker hub (Why? For entirely non-technical reasons from what I can determine). This docker service limitation should also be disappearing soon as part of fixing the IPv6 accessibility of OCaml infrastructure. More to come on that soon.

hannesm commented 6 months ago

Thanks for your instructions. Still, I don't have any "docker" executable on my Unix operating system, so I'm out of luck trying to do anything in this regard. I still don't understand the setup and why it is so complex (and which bits are pushed around for what).

In any case, it seems like your solution "wait until ocaml.org hosts the package stuff" is what you're aiming for. I don't have anything to contribute there. For what it is worth, there's still a huge delay from "someone merged a PR" to "it shows up on opam.ocaml.org" (> 20 hours). But the accumulated technical debt in your deployment seems to be superseeded (soon, or at least in a planned future) by some other piece of technology (which by luck may result in quicker updates, though there may be ocaml.org package index and opam.ocaml.org/index.tgz being out of sync -- but maybe that is not relevant for those maintaining "ocaml.org").