ocaml-multicore / domainslib

Parallel Programming over Domains
ISC License
172 stars 30 forks source link

Adjust PBTs based on recommended_domain_count #112

Closed jmid closed 1 year ago

jmid commented 1 year ago

I noticed the CI run of #110 had 32-bit Failure("failed to allocate domain") in test/task_one_dep.ml

I believe this is because the test does not take into account the available cores on the target machine. This PR therefore adjusts the test to do so.

I hope this is sufficient to please the (32-bit) CI so that we can avoid disabling certain platforms like #111 proposes :crossed_fingers:

jmid commented 1 year ago

On the positive side, this roughly halves the Windows CI time. On the negative side there was still a "failed to allocate domain" error... In principle we could just adjust the tested property to make this acceptable behaviour.

I believe the 5.0 x86_32-bit ocaml-ci error is due to using (modes native) - which will not be supported. Since dune should choose the "best mode available" simply deleting these lines should be sufficient to avoid the error. However I'm wondering whether the native lines were added to disable the tests on non-native platforms? :thinking:

jmid commented 1 year ago

I believe the 5.0 x86_32-bit ocaml-ci error is due to using (modes native) - which will not be supported.

I now see 5.0-ppc64 and 5.0-s390x are going to fail for the same reason. Perhaps the proper fix would be to disable the tests (not the package) on bytecode-only architectures?

Sudha247 commented 1 year ago

Thanks for taking a look at this @jmid!

However I'm wondering whether the native lines were added to disable the tests on non-native platforms?

I think native modes were added for performance reasons (@kayceesrk can correct me if I'm wrong) - it should be okay to let dune decide what to run. Would that fix the failures on bytecode-only platforms?

kayceesrk commented 1 year ago

I think native modes were added for performance reasons (@kayceesrk can correct me if I'm wrong) - it should be okay to let dune decide what to run. Would that fix the failures on bytecode-only platforms?

I don't think it is useful to run performance-sensitive workloads in bytecode mode. And, domainslib is for performance. I can't imagine bytecode runs for anything using domainslib being useful.

kayceesrk commented 1 year ago

That said, if you just want compatibility, you can remove to (modes native).

jmid commented 1 year ago

Would that fix the failures on bytecode-only platforms?

It would at least fix the Build failed errors due to Error: No rule found for test/off_by_one.exe such as the ones in https://ocaml.ci.dev/github/ocaml-multicore/domainslib/commit/480029b010decd3e913cfcb3c9c2669327a4b9b3/variant/debian-11-5.0_s390x_opam-2.1

Now, with the s390x native backend just merged yesterday, that will save us from having to roll a new Domainslib release to support this reestablished native platform when it get released with 5.1 or 5.2.

It will not fix the Failed to allocate domain error that we are still seeing. I suspect this might be caused by dune runtest running parallel tests in parallel :thinking:

There is something odd to me with all tests except the QCheck ones disabled in bytecode mode. Disabling all the tests (not the package) on bytecode-only architectures strikes me as more consistent and future proof. It should be possible to do so in the opam-file.

Sudha247 commented 1 year ago

It will not fix the Failed to allocate domain error that we are still seeing. I suspect this might be caused by dune runtest running parallel tests in parallel

Hm, possibly. What you mean is running parallel tests in parallel ends up spawning more than the limit - wondering if this could happen in native runs too?

Disabling all the tests (not the package) on bytecode-only architectures strikes me as more consistent and future proof.

I'm not sure why we should support architectures where we know the testsuite doesn't go through. Is it for the sake of completeness or am I missing something?

kayceesrk commented 1 year ago

If there is an easy way to disable 32-bit builds, ppc64, s390x in the ocaml-ci, we should do it. s390x will land in 5.1. ppc64 sometime in the future.

Sudha247 commented 1 year ago

If there is an easy way to disable 32-bit builds, ppc64, s390x in the ocaml-ci, we should do it.

111 does it - but in a restrictive way by making the package unavailable in all those platforms. I'm open to considering a less restrictive way to do it, if that's beneficial.

jmid commented 1 year ago

Great, I think fewer restrictions is key to enable broader adoption :+1:

I've made 3 changes:

Let's see how this fares on ocaml-ci... :popcorn:

jmid commented 1 year ago

CI summary (while I wait for the ppc64 run to complete):

jmid commented 1 year ago

The ppc64 run ended up failing with a time out, so 0637fb1 also disables the two long running PBTs on ppc64 (64 bit bytecode for the moment).

07ca1ff adjusts test/off_by_one to accept a Failure "failed to allocate domain".

This leaves the backtrace option failure and the decomposition assertion failure.

jmid commented 1 year ago

After confirming locally I've now opened an issue for the LU_decomposition assertion failure on arm32 here: https://github.com/ocaml/ocaml/issues/12267

jmid commented 1 year ago

I would welcome input on the backtrace (bytecode) failure, where it seems Printexc.get_raw_backtrace returns None.

Sudha247 commented 1 year ago

We don't know yet if the backtrace test failure is a compiler bug. We have an issue for the LU decomposition failure. Anyways, I think we can disable backtrace (and open an issue if needed) and LU_decompositon tests on bytecode-only platforms and merge this PR.

jmid commented 1 year ago

I've now dug a bit further and it seems to be a dune issue on bytecode only switches: https://github.com/ocaml/dune/issues/7845 I've also disabled the backtrace test on bytecode platforms in b10075a.

Let's see how this fares on the CI :popcorn:

Sudha247 commented 1 year ago

The CI is happy finally! Thanks for the work, Jan. Merging this.