minnervva / torchdetscan

This is a tool for finding non-deterministic functions in your pytorch code.
https://github.com/minnervva/torchdetscan
MIT License
1 stars 0 forks source link

RTFS DeepMD #4

Open markcoletti opened 10 months ago

markcoletti commented 10 months ago

Take a deep dive into the DeepMD code-base. We need to understand fundamentally how it works.

asedova commented 10 months ago

Found the dataloader in https://github.com/deepmodeling/deepmd-kit/blob/master/deepmd/train/trainer.py. It uses https://github.com/deepmodeling/deepmd-kit/tree/master/deepmd/utils/random.py and data_system.py in that same utils dir. random.py is just a wrapper around an older numpy random function (RandomState) which is technically deprecated, but there is a seed set that is passed in from the input json file that should work ok. Otherwise the frames are just chosen using this RNG (which is also strange since you would think you would want to train on ALL the frames, not just a random subset, that could potentially have repetitions?). But anyway, it does seem like at this DeePMD level, the data loading should be deterministic. We still may have some type of streaming happening at the TF or Horovod level though.

asedova commented 10 months ago

Need to next check the TF/horovod levels of distributed training to see if there may be some task stealing or asynchronous data streaming or something.