Dramatic overhead of metatrain for MD in LAMMPS

spozdn commented 1 week ago

As investigated with @DavideTisi, for his system, intrinsic time of PET (energies and forces) is 7.8e-5 seconds/atom on a V100 gpu on IZAR. Simultaneously, on the same node, the time in LAMMPS is about 3.1e-3 seconds/atom. So, LAMMPS MD is about 40 !!! times slower.

@abmazitov last time we touched this, I got the impression, that the current overhead of LAMMPS is about 1.5 times, 2 times tops, not 40 if using https://github.com/spozdn/pet/blob/neighbors_convert_cpp/src/neighbors_convert.cpp. @DavideTisi though, told me that he was using right versions for both PET and metatrain. So, @abmazitov, could you take a look at this?

upd. the number of atoms in supercell is 960

DavideTisi commented 1 week ago

just to add to this I try to "profile" with print statement the lammps interface the most time spend ( ~2.2 secs per iterations) is in line 502 of pair_metatensor.cpp which is the Torch backward. According to @spozdn and @Luthaf the fault is the complexity of the graph

abmazitov commented 1 week ago

@spozdn @DavideTisi There is another issue, discovered some time ago by @Luthaf, @frostedoyster and me, and related to a non-vectorised CPU-GPU data transfer implementation in models' backward. I'm talking about this https://github.com/lab-cosmo/metatensor/pull/636 PR. As far as I can see, it should be already available in the latest metatensor-torch release. @Luthaf do we have the fixed commit tag in the LAMMPS metatensor-torch dependency, so https://github.com/lab-cosmo/metatensor/pull/636 is not available in LAMMPS?

Luthaf commented 1 week ago

So this overhead can come from multiple places, and only one of them is inside metatrain:

the wrapper of PET model into an interface compatible with metatensor here: https://github.com/lab-cosmo/metatrain/blob/8ae0a9426563ff5de65a6fbebc3c8245fb609c5d/src/metatrain/experimental/pet/model.py#L71-L112
the checks performed by MetatensorAtomisticModel, which can be disabled with check_consistency=False
the code inside LAMMPS itself.

The timings @abmazitov gave me a while ago where around:

110ms for neighbor list conversion from LAMMPS format to metatensor format
150ms for the model forward (this includes 1. and 2. above)
200ms for the model backward

I am slowly looking into the NL conversion step, if other people want to look into it I'm happy to explain the code!

For Davide's results, something else could be happening here. I could try to add some code to print the number of nodes in the computational graph inside LAMMPS, but if this is the bottleneck the fix would have to come from changes in 1. or PET itself.

Luthaf commented 1 week ago

@Luthaf do we have the fixed commit tag in the LAMMPS metatensor-torch dependency

Yes, I recently updated this to pull the latest release of metatensor-torch in LAMMPS. If you believe this is the issue you can try to build this commit https://github.com/lab-cosmo/lammps/commit/60ff741ee7d60644c8bd9642952e71137c8b6b72

lab-cosmo / metatrain

Dramatic overhead of metatrain for MD in LAMMPS #274