Open potus28 opened 11 months ago
Hi @potus28 ,
Thanks for your interest in our code!
If training continues at a reasonable speed, this is expected behavior due to TorchScript JIT compilation after 3 warmup calls. On the other hand, if you observe sustained degradation in performance, please see issue https://github.com/mir-group/nequip/discussions/311 and report the relevant details there.
Hi @Linux-cpp-lisp, thanks! Yes, after the training continues at a reasonable speed after this call. Thanks again, and have a great day!
Hello MIR group,
I'm using Allegro as well as NequIP and FLARE to build MLIPs for modeling condensed phase systems and systems for heterogeneous catalysis, and I'm having a little bit of difficulty with Allegro. On my laptop, I can build smaller Allegro models and training goes as expected. However, for larger models that I am training on Perlmutter, after training on the second batch of the first epoch, it takes a while for the 3rd batch to process, and I get the message copied below. After this message gets displayed, training continues as expected. Have you seen this issue before, and if so is there a way to fix this and make training not take so long in the beginning? I've copied the error message, my allegro config file, and my SLURM script on Perlmutter below. The SLURM script and config file are for a hyperparameter scan, and for the hyperparameters I have looked at so far they all have this issue. Any help would be much appreciated. Thanks!
Sincerely, Woody
Message that appears in training:
SLURM script on Perlmutter:
Allegro config file: