lab-cosmo / metatrain

Training and evaluating machine learning models for atomistic systems.
https://lab-cosmo.github.io/metatrain/
BSD 3-Clause "New" or "Revised" License
13 stars 3 forks source link

Distributing tests across multiple CPUs #252

Closed PicoCentauri closed 3 weeks ago

PicoCentauri commented 3 weeks ago

Using pytest-xdist and parallization of the data generation to speed up tests (a little). They are still very slow which basically originates from the training several BPNNs in the tests. We should tackle this soon because running the whole test suite is really annoying. Most of the time currently goes into these tests:

34.89s call     tests/cli/test_train_model.py::test_command_line_override[architecture.training.num_epochs=2 architecture.training.batch_size=3]
31.17s call     tests/cli/test_train_model.py::test_empty_test_set
29.64s call     tests/cli/test_train_model.py::test_train[None]
26.82s call     tests/cli/test_train_model.py::test_continue
24.43s call     tests/cli/test_train_model.py::test_train_explicit_validation_test[True-True-2]
24.09s call     tests/cli/test_train_model.py::test_model_consistency_with_seed[experimental.soap_bpnn-1234]
22.73s call     tests/cli/test_train_model.py::test_train_explicit_validation_test[False-False-2]
21.52s call     tests/cli/test_train_model.py::test_continue_different_dataset
20.98s call     tests/cli/test_train_model.py::test_train_explicit_validation_test[True-True-1]
20.32s call     tests/cli/test_train_model.py::test_train_multiple_datasets

I found the following timings on my machine with an Apple M2

single thread

first run: 6m32.412s second run: 4m11.565s

threaded 8threads

first run: 3m38.036s second run: 2m15.751s

which is a speedup by a factor of 2.

Contributor (creator of pull-request) checklist


📚 Documentation preview 📚: https://metatrain--252.org.readthedocs.build/en/252/

frostedoyster commented 3 weeks ago

Uhhhh of course it breaks the regression tests

In this case, I don't see how we can make this work because I don't know if we can control the order in which the tests are executed or which test files are handled by the same worker (this will also vary depending on how many threads there are).

I would suggest executing the architecture tests serially (they're very light anyway)