ocurrent / ocaml-ci

A CI for OCaml projects
https://ocaml.ci.dev
MIT License
111 stars 74 forks source link

Ruling out `pthread_cond_signal` failure to wake up `pthread_cond_wait` #963

Open polytypic opened 2 months ago

polytypic commented 2 months ago

There is a known bug which causes a pthread_cond_signal to fail to wake up a pthread_cond_wait. The OCaml runtime and libraries that come with OCaml use those and are known to be affected by this bug (search for "OCaml" in the issue).

I observe some multicore OCaml stuff I'm developing locking up (some test hangs and is then killed after an hour) on some of the machines (e.g. debian 12 ARM with OCaml 4.14) occasionally. This kind of symptom could be explained by that pthread_cond_signal bug, but it could also point to some issue in my code. Knowing that it cannot be the pthread_cond_signal bug would help a lot.

It would be great if we could make sure that all the OCaml CI machines have this bug fixed/patched. This way people working on multicore OCaml stuff could perhaps sleep a little better. :sweat_smile:

talex5 commented 2 months ago

I've seen that bug locally, but not in CI as far as I can recall: https://github.com/ocaml-multicore/eio/issues/700

https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1899800/comments/5 says:

This bug was fixed in the package glibc - 2.32-0ubuntu5

polytypic commented 2 months ago

https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1899800/comments/5 says:

This bug was fixed in the package glibc - 2.32-0ubuntu5

IIUC, that applies only to glibc on Ubuntu (?) and the version of glibc on Debian, for example, does not (necessarily) have the fix.

The lockups I've observed have definitely mostly happened on Debian and mostly on the ARM machines with OCaml 4.14, but I do recall seeing lockups on other Debian machines. Assuming that the cause is the pthread_cond_signal bug and that it is not fixed on Debian would match my (recollection of my) observations. However, I have, unfortunately, not kept record of all such lockups.

For background, I had some tests that used systhreads extensively in OCaml 4 and those locked up occasionally. I have since then reduced the use of systhreads on OCaml 4 and the lockups seem to happen less frequently, but I still see them.

edwintorok commented 2 months ago

I think the only way to apply the fix is to rebuild libc (which sort of defeats the purpose of testing code on a given distro if we then fundamentally change it), and it appears that Ubuntu is the only distro that patched it.

Which is a bit disappointing given the issue has been open since 2020, and has even had a TLA+ proof for the fix in 2023 (although that is not the fix that Ubuntu applied, I think Ubuntu only applied the one liner workaround, not the more complicated fix).

Although if your distro is old enough to have a libc older than 2.27 then you're not affected.

Maybe we could convince Debian to take the same patch that Ubuntu has, at least until upstream glibc gets around to review and apply the patches?

edwintorok commented 2 months ago

Could we perhaps switch the non-x86_64 builders to Ubuntu though? That should give us more coverage on other architectures when looking for bugs in the OCaml runtime or multicore libraries, while avoiding the known libc bug.

mtelvers commented 2 months ago

The workers themselves are all running Ubuntu. When you see something like debian-12-4.14_s390x_opam-2.2 this is a runc container running Debian 12 running on top of an Ubuntu s390x machine. We could test ubuntu-24.04-* instead of debian-12-*. However, Debian supports more architectures generally, e.g. arm32v7, i386... except for RISCV64, which Ubuntu has.

edwintorok commented 2 months ago

I think for the bug what matters is the version of glibc, i.e. the version of the container, not the host OS.

Although there are quite a few Ubuntu docker images for other architectures: https://hub.docker.com/r/i386/ubuntu https://hub.docker.com/u/arm32v7 ... (you can find them more easily at https://doi-janky.infosiftr.net/job/multiarch/, or following the links from https://github.com/docker-library/official-images?tab=readme-ov-file#architectures-other-than-amd64)

shonfeder commented 2 months ago

This seems right to me

which sort of defeats the purpose of testing code on a given distro if we then fundamentally change it

so, iiuc, the upshot is that there isn't a fix we can reasonably do in ocaml-ci for this. But could it be that some of the multicore-specific CIs may want tweaks to provide a testing environment for your purposes that doesn't produce noise from flawed standard dependencies @polytypic?

polytypic commented 2 months ago

It is an interesting situation.

I read through the thread here and, IIUC, there is a mention that the "mitigation" patch used in Ubuntu still has issues.

So, to put it a bit provocatively, at the moment, Linux and OCaml are incompatible.

I don't think that using a version of glibc with this one bug fixed would completely defeat the purpose of testing on a given Linux distribution — pthread_cond_signal is a infinitesimal detail of a Linux distribution. I would agree that it would defeat the purpose assuming the bug is never going to be fixed — in that case OCaml itself should be fixed not to use the broken pthread_cond_signal, but I would hope and assume that it will eventually be fixed and after that we wouldn't need a patch.

edwintorok commented 2 months ago

Would musl libc be a useful option? Wouldn't require manually patching/rebuilding packages, and would still be an "official" package from a distribution, e.g. Debian has one. (Assuming that musl doesn't have the same bug as glibc, I'll have to do some tests, I was able to reproduce the glibc bug a while ago on my machine, I'll have to see whether I still can, and then run the equivalent test on musl too).