pybamm-team / PyBaMM

Fast and flexible physics-based battery models in Python
https://www.pybamm.org/
BSD 3-Clause "New" or "Revised" License
1.11k stars 547 forks source link

Jax-IREE not working on MacOS-13 #4521

Open kratman opened 4 weeks ago

kratman commented 4 weeks ago

While removing the deprecated MacOS-12 runners for GitHub actions, I ran into issues with IREE. The comments suggest that it should work with MacOS version 13, but the tests fail. It seems to work fine one MacOS-14 and Linux.

As a temporary fix, I suggested changed the version in PR #4520

            if (not sys.version_info[:2] == (3, 11)) or mac_ver < 14:
                warnings.warn(
                    (
                        "IREE is only supported on MacOS 13 (or higher) and Python"
                        "version 3.11. Setting PYBAMM_IDAKLU_EXPR_IREE=OFF."
                    ),
                    stacklevel=2,
                )
                return "OFF"

A few other things I noticed while looking into the failures:

agriyakhetarpal commented 4 weeks ago

cc @jsbrittain who could help us with this. I had a macOS device running macOS 13, but I couldn't reproduce the crash when running the relevant tests from the test suite locally (ran them a few times). I updated it to macOS 14 today, so I don't know if we can reliably reproduce this outside of a CI runner.

jsbrittain commented 4 weeks ago

@agriyakhetarpal just a very quick comment for the time being: as noted above, there are multiple components that need to be upgraded together, but this has to be done carefully. I actually took a look at this a couple of months back in an effort to get rid of the demotion code (IREE only runs at 32-bit precision at the present). Notes from that investigation are here and may prove helpful: https://github.com/jsbrittain/PyBaMM/issues/4

jsbrittain commented 3 weeks ago

Some follow-up exploratory notes. I tested upgrading to the latest Jax/Jaxlib (0.4.34) and corresponding iree-compiler (1036) with relaxed version constraints on everything apart from Windows. I don't have a macOS-13 machine but using gh-runners produced a slightly different set of results to your previous run (see https://github.com/jsbrittain/PyBaMM/actions/runs/11391403398/job/31695386575?pr=5):

ubuntu 3.9 jax incompatible 3.10-3.12 operations cancelled

macos-13 3.9 jax incompatible 3.10 iree incompatible (only 3.11, 3.12 build available) 3.11 worker crashes 3.12 worker crashes

macos-14 3.9 jax incompatible 3.10 iree incompatible (only 3.11, 3.12 build available) 3.11 worker crashes 3.12 worker crashes

windows 3.9 jax incompatible 3.10-3.12 pass

Looking into the available wheels, here are the current offerings: jax (0.4.34): manylinux,macos[10+,11+],win (all for python 3.10-3.13) iree-compiler (1036): manylinux(py3.9-3.12), macos(13+)(py3.11-3.12), win(py3.11-3.12)

Re the macos worker crashes (all of which seem to occur on TestIDAKLUSolver), there are a few blogs suggesting that pytest-xdist (parallel testing) is associated with the 'node down: Not properly terminated' error that we are seeing when run with certain libraries, although its not clear exactly what is causing it. Curiously, my run crashed workers on both macos-13 and macos-14, so it seems to impact both, but may be less frequent on macos-14, or simply fragile to particular jax/iree/runner combinations. I have seen local/inline imports mentioned as problematic (as are used in places in pybamm). Some context here: https://github.com/pytest-dev/pytest/issues/ 3216.

On a slightly more positive note the jax/iree combination tested compiles and successfully solves locally (which is much more promising than my experience a few months back: https://github.com/jsbrittain/PyBaMM/issues/4). So, this may principally be a test problem.

I had a quick chat to @martinjrobins and - given the dependency issues - it may be an option to simply disable IREE for the time being. It is only enabled by building with specific user-configurable flags at present (PYBAMM_IDAKLU_EXPR_IREE=OFF by default) as it is still slow compared to casadi; in-fact, it is only being enabled in tests to maintain the functionality until it becomes more useable, but is not built into the wheels at all.

agriyakhetarpal commented 3 weeks ago

Thanks for the details, @jsbrittain! Yes, that sounds like a good plan, and IREE has not been enabled yet for https://github.com/pybamm-team/pybammsolvers either. Though, another option could be to keep testing the IREE code, and instruct pytest-xdist to run the related tests in "serial mode" for now; that way, we still prevent the failures from creeping in and still have it in CI in case something is modified later – this could help make the task of updating to the JAX 0.4.34 + iree-compiler 1036 combo easier later on, what do you think?

Adding a decorator @pytest.mark.xdist_group(name="iree") will run all tests decorated with it on the same worker.

kratman commented 3 weeks ago

@jsbrittain, @agriyakhetarpal I think it might be better to remove IREE functionality for now instead of just disabling it and bring it back in the future when things improve. At the very least it should just be disabled for MacOS since allowing it for only a single python version makes it difficult to use. Python 3.9 reaches end of life about 1 year from now, so that part of the Jax-IREE complexity will go away when we remove support for 3.9

Reasons for removal:

We can always bring back the IREE interface in the future or dig into if it is possible to build a "bring your own solver" plugin sort of architecture

kratman commented 3 weeks ago

@jsbrittain Of course I don't want to hinder any research that requires IREE functionality. Is there anything critical that depends on it?

jsbrittain commented 3 weeks ago

@jsbrittain Of course I don't want to hinder any research that requires IREE functionality. Is there anything critical that depends on it?

@kratman None that I am aware of. The MLIR/IREE approach has the potential to offer fast and flexibilty expression evaluation, capable of lowering onto numerous architectures (including GPU), but it has not achieved that potential yet; at present we are working around problems (e.g. precision and compatibility) that may be resolved with more mature/stable releases.

kratman commented 3 weeks ago

Disabling IREE on MacOS was done in #4528, but it is still available on Linux