luigibonati / mlcolvar

A unified framework for machine learning collective variables for enhanced sampling simulations
MIT License
91 stars 24 forks source link

CI are failing to due to out of memory issue #128

Closed luigibonati closed 4 months ago

luigibonati commented 4 months ago

CI on macOS started failing due to out of memory error. Github started testing the environment on new macbooks M1/M2 (Mps is the service related to their integrated gpu) and lightining tries to use it as GPU (but it is only 8 GB).

`E RuntimeError: MPS backend out of memory (MPS allocated: 0 bytes, other allocations: 0 bytes, max allowed: 7.93 GB). Tried to allocate 256 bytes on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

../../../micromamba/envs/test/lib/python3.9/site-packages/torch/nn/modules/module.py:1158: RuntimeError ----------------------------- Captured stderr call ----------------------------- INFO: GPU available: True (mps), used: True INFO: TPU available: False, using: 0 TPU cores INFO: IPU available: False, using: 0 IPUs INFO: HPU available: False, using: 0 HPUs ------------------------------ Captured log call ------------------------------- INFO lightning.pytorch.utilities.rank_zero:setup.py:156 GPU available: True (mps), used: True INFO lightning.pytorch.utilities.rank_zero:setup.py:159 TPU available: False, using: 0 TPU cores INFO lightning.pytorch.utilities.rank_zero:setup.py:169 IPU available: False, using: 0 IPUs INFO lightning.pytorch.utilities.rank_zero:setup.py:179 HPU available: False, using: 0 HPUs`

some solutions:

btw i suspect that 8 gb should be enough for running the notebooks, so probably not all of it is usable.

@andrrizzi what do you think?

andrrizzi commented 4 months ago

Since the M1 on github is very new, there is also the possibility that one or more dependency is not quite ready for the new cpus.

For the short-term, we could think of testing macos-13 or macos-12 instead of macos-latest.

If the M1 cpus will still cause problems in a few weeks, we might need to dig into the tests and see which one is consuming so much memory or whether there's some memory leakage somewhere.

EnricoTrizio commented 4 months ago

I tried to use the latest version of PyTorch, as they apparently addressed the problem, but it didn't change much. For now, I would test it on MacOS-13 and leave an open issue to remember to change it in the future when the 14 becomes stable again.