pybamm-team / PyBaMM

Fast and flexible physics-based battery models in Python
https://www.pybamm.org/
BSD 3-Clause "New" or "Revised" License
1.12k stars 547 forks source link

Fix IDAKLU solver crash when running in multiple threads + when telemetry is prompted for #4583

Open agriyakhetarpal opened 1 week ago

agriyakhetarpal commented 1 week ago

Description

This came up as an unrelated issue in #4582 when testing the wheels and apparently stems from #4441 where a thread was being used to check for the input. This caused the Linux wheel tests to fail: https://github.com/agriyakhetarpal/PyBaMM/actions/runs/11805727556/job/32890103930. This PR redoes that with a multi-platform approach and avoids the use of threads altogether by checking for different environments: Windows via msvcrt and Linux/macOS via termios.

~A possible corner case for Jupyter Notebooks (running in VS Code versus Jupyter Notebook/JupyterLab) has also been fixed.~ I've resorted to a more rudimentary fix using the previous method for now, since this fix was erroneous and did not work intermittently.

Type of change

Please add a line in the relevant section of CHANGELOG.md to document the change (include PR #) - note reverse order of PR #s. If necessary, also add to the list of breaking changes.

Key checklist:

You can run integration tests, unit tests, and doctests together at once, using $ python run-tests.py --quick (or $ nox -s quick).

Further checks:

agriyakhetarpal commented 1 week ago

This approach might seem overkill but this is what came to mind first and seems like the most robust way to me, so I'd like to test this a bit more – I've converted this to a draft for now. I triggered a wheel build here, let's see if it passes: https://github.com/agriyakhetarpal/PyBaMM/actions/runs/11808256997

I've confirmed the fix locally with a fresh PyBaMM installation (i.e., with the pybamm/config.yml file not present in my Application Support folder) every time before importing PyBaMM:

across the following scenarios:

codecov[bot] commented 1 week ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 99.26%. Comparing base (4223be8) to head (5495814). Report is 1 commits behind head on develop.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #4583 +/- ## =========================================== - Coverage 99.26% 99.26% -0.01% =========================================== Files 302 302 Lines 22889 22866 -23 =========================================== - Hits 22721 22698 -23 Misses 168 168 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

agriyakhetarpal commented 1 week ago

The buttons aren't working in the notebooks, actually – I'm quite sure they were working when I was testing earlier, so I guess I made some additional changes which made them break. I'll debug later in the day, and I'll consider dropping the buttons in favour of a simple input in case I don't get them to work.

agriyakhetarpal commented 1 week ago

I switched to a more basic implementation for Jupyter notebooks, and I've seen bug reports about how inputs don't work properly with JupyterLab <4, but I don't think (too) many people are using version 3 now, so we should be good.

agriyakhetarpal commented 1 week ago

Additionally, I tested on Google Colab by installing PyBaMM in editable mode from my branch – we should be ready to go with this. Suggestions on cleaning up the code are welcome. The only place I haven't been able to test is Windows, because I don't have a Windows machine.

agriyakhetarpal commented 1 week ago

Linux wheel builds passing: https://github.com/agriyakhetarpal/PyBaMM/actions/runs/11819747790/job/32930388878

kratman commented 1 week ago

Ok I will take a close look at this in an hour or so

agriyakhetarpal commented 1 week ago

Continuing from my question above: though I pushed e4859dcff274d584ce31b31cb33e3217f302799f which fixes the failing test, it also changes the behaviour on pressing "Enter" to enable telemetry – which, to me, does not seem like good UX because it can cause an inadvertent "yes" from a user. It might be better to be explicit rather than implicit: either allow only a "yes" or a "no" in the response and ask them to enter their response again if it's empty, or let "Enter" disable telemetry rather than enabling it). I understand that it would be good for us to get a reasonable amount of telemetry data from as many users as we can, but we should also be conscientious with our approach. @valentinsulzer, do you have any thoughts? I might have missed this when reviewing #4441. That said, if prompts in other applications do regard pressing Enter/an empty input as a "yes", then this approach is completely fine with me.

kratman commented 3 days ago

Crashes should be resolved. I could not reproduce any of the notebook issues

agriyakhetarpal commented 3 days ago

I'd say that this PR could still be useful because it avoids using threading altogether for telemetry. Could you please re-review it? I was under the assumption that #4591 would go in after this PR went in.

kratman commented 2 days ago

@agriyakhetarpal I built the wheel with develop and did some testing on Windows.

A few things I noticed:

  1. The tests in CI and wheel build all pass: https://github.com/kratman/PyBaMM/pull/5
  2. Some tests fail if I try to run the tests locally with the wheel, but the same tests fail when I repeat the process for the v24.9.0 release. The failing tests all appear to be from Jax
  3. Forcing posthog to be active during tests does show posthog in some of the tracebacks, but they appear to be some of the same tests that I saw failing in (2).
  4. When I forced posthog to run, I also disabled the pybamm.config.generate() function, so any remaining problems are likely in the internals of posthog, not from the use of threading while generating the config.
  5. The timeout for creating the config is likely too short. With the slow import of a fresh PyBaMM install, it is not hard to miss the window for a 10 second opt-in.

The fact that tests are failing locally on windows bothers me, but it looks like some of this was already present and not caught by all of our tests. It is also possible that the ~45 failing Jax tests are due to issues with my Windows setup since the tests pass in CI. I won't know for sure until I do additional testing for the release.

I will continue investigating the failures on windows, but I don't think removing the threading or adding extra logic to text output is needed at the moment. The changes to the installation guide, noxfile.py, and the pyproject.toml look good.

Next steps:

agriyakhetarpal commented 2 days ago

Just a quick comment for now; I'll get back to the rest in a while: @kratman, if you can't reproduce the Windows crashes in CI, could you also check how old the machine that you are using locally is? Some JAX operations might crash or simply not work on very old machines because of the lack of AVX512 instructions, so maybe this is indeed just something that is for your setup and should be fine for slightly newer Windows computers.

kratman commented 2 days ago

Some JAX operations might crash or simply not work on very old machines because of the lack of AVX512 instructions, so maybe this is indeed just something that is for your setup and should be fine for slightly newer Windows computers.

This is not a particularly old machine. It is an i5-1235U (2022), which is the Alder Lake group of chips that do not have AVX-512, it does have AVX and AVX2 though