Open kratman opened 4 weeks ago
cc @jsbrittain who could help us with this. I had a macOS device running macOS 13, but I couldn't reproduce the crash when running the relevant tests from the test suite locally (ran them a few times). I updated it to macOS 14 today, so I don't know if we can reliably reproduce this outside of a CI runner.
@agriyakhetarpal just a very quick comment for the time being: as noted above, there are multiple components that need to be upgraded together, but this has to be done carefully. I actually took a look at this a couple of months back in an effort to get rid of the demotion code (IREE only runs at 32-bit precision at the present). Notes from that investigation are here and may prove helpful: https://github.com/jsbrittain/PyBaMM/issues/4
Some follow-up exploratory notes. I tested upgrading to the latest Jax/Jaxlib (0.4.34) and corresponding iree-compiler (1036) with relaxed version constraints on everything apart from Windows. I don't have a macOS-13 machine but using gh-runners produced a slightly different set of results to your previous run (see https://github.com/jsbrittain/PyBaMM/actions/runs/11391403398/job/31695386575?pr=5):
ubuntu 3.9 jax incompatible 3.10-3.12 operations cancelled
macos-13 3.9 jax incompatible 3.10 iree incompatible (only 3.11, 3.12 build available) 3.11 worker crashes 3.12 worker crashes
macos-14 3.9 jax incompatible 3.10 iree incompatible (only 3.11, 3.12 build available) 3.11 worker crashes 3.12 worker crashes
windows 3.9 jax incompatible 3.10-3.12 pass
Looking into the available wheels, here are the current offerings: jax (0.4.34): manylinux,macos[10+,11+],win (all for python 3.10-3.13) iree-compiler (1036): manylinux(py3.9-3.12), macos(13+)(py3.11-3.12), win(py3.11-3.12)
Re the macos worker crashes (all of which seem to occur on TestIDAKLUSolver), there are a few blogs suggesting that pytest-xdist
(parallel testing) is associated with the 'node down: Not properly terminated' error that we are seeing when run with certain libraries, although its not clear exactly what is causing it. Curiously, my run crashed workers on both macos-13 and macos-14, so it seems to impact both, but may be less frequent on macos-14, or simply fragile to particular jax/iree/runner combinations. I have seen local/inline imports mentioned as problematic (as are used in places in pybamm). Some context here: https://github.com/pytest-dev/pytest/issues/ 3216.
On a slightly more positive note the jax/iree combination tested compiles and successfully solves locally (which is much more promising than my experience a few months back: https://github.com/jsbrittain/PyBaMM/issues/4). So, this may principally be a test problem.
I had a quick chat to @martinjrobins and - given the dependency issues - it may be an option to simply disable IREE for the time being. It is only enabled by building with specific user-configurable flags at present (PYBAMM_IDAKLU_EXPR_IREE=OFF by default) as it is still slow compared to casadi; in-fact, it is only being enabled in tests to maintain the functionality until it becomes more useable, but is not built into the wheels at all.
Thanks for the details, @jsbrittain! Yes, that sounds like a good plan, and IREE has not been enabled yet for https://github.com/pybamm-team/pybammsolvers either. Though, another option could be to keep testing the IREE code, and instruct pytest-xdist
to run the related tests in "serial mode" for now; that way, we still prevent the failures from creeping in and still have it in CI in case something is modified later – this could help make the task of updating to the JAX 0.4.34 + iree-compiler 1036 combo easier later on, what do you think?
Adding a decorator @pytest.mark.xdist_group(name="iree")
will run all tests decorated with it on the same worker.
@jsbrittain, @agriyakhetarpal I think it might be better to remove IREE functionality for now instead of just disabling it and bring it back in the future when things improve. At the very least it should just be disabled for MacOS since allowing it for only a single python version makes it difficult to use. Python 3.9 reaches end of life about 1 year from now, so that part of the Jax-IREE complexity will go away when we remove support for 3.9
We can always bring back the IREE interface in the future or dig into if it is possible to build a "bring your own solver" plugin sort of architecture
@jsbrittain Of course I don't want to hinder any research that requires IREE functionality. Is there anything critical that depends on it?
@jsbrittain Of course I don't want to hinder any research that requires IREE functionality. Is there anything critical that depends on it?
@kratman None that I am aware of. The MLIR/IREE approach has the potential to offer fast and flexibilty expression evaluation, capable of lowering onto numerous architectures (including GPU), but it has not achieved that potential yet; at present we are working around problems (e.g. precision and compatibility) that may be resolved with more mature/stable releases.
Disabling IREE on MacOS was done in #4528, but it is still available on Linux
While removing the deprecated MacOS-12 runners for GitHub actions, I ran into issues with IREE. The comments suggest that it should work with MacOS version 13, but the tests fail. It seems to work fine one MacOS-14 and Linux.
As a temporary fix, I suggested changed the version in PR #4520
A few other things I noticed while looking into the failures:
utils.py
andpyproject.toml
noxfile.py
andpyproject.toml