torchmd / torchmd-net

Training neural network potentials
MIT License
330 stars 74 forks source link

Allow a changing box in models #221

Closed RaulPPelaez closed 11 months ago

RaulPPelaez commented 1 year ago

@franklinHu1 You mentioned this functionality would be useful to your use case. Could you please elaborate on your requirements? Would it be enough to have the possibility of changing the box between training and inference or you would rather have the possibility of calling the models with model(z, pos, batch, box)?

FranklinHu1 commented 1 year ago

I think making it such that you can call the model with model(z, pos, batch, box) would be best, especially because when integrating with openmm torch to run dynamics, I could then design the ForceModule such that it takes in the box as an argument:

class ForceModule(torch.nn.Module):
    def forward(self, positions, box_vectors):
        ...
        energy, force = self.model.forward(self.z, positions, box_vectors)

In terms of training, it would be great if the box vectors could be an argument that is interpreted from the yaml configuration file. This is a simple change to train.py and the configuration yaml file, something like:

...
pbc_box : [12.4, 12.4, 12.4]
...

and in train.py:

...
parser.add_argument('--pbc-box', ...)
...

Obviously I then convert the box lengths into a diagonal (3, 3) matrix for the vectors.

In summary, I think it would be very helpful if:

Thank you very much, and please let me know if anything needs to be clarified!

FranklinHu1 commented 1 year ago

Hello,

Just checking in, has there been any movement on this issue?

Thank you!

RaulPPelaez commented 1 year ago

Sorry, I have not gotten around it yet. I will get to it ASAP. I will probably implement it just as you designed. If two boxes are provided (yaml + argument to forward) the one in forward takes precedence.

There is the issue of where the box should live. Sometimes one really needs the box information CPU side (i.e when building a cell list), and it is small enough that one can communicate it to the GPU without a memory copy (as an argument to a CUDA kernel). I believe the box should be a 3x3 tensor that resides in CPU memory. Up until now the box is something that cannot change, so the model can store it however it likes.

FranklinHu1 commented 1 year ago

That sounds great, thank you!

RaulPPelaez commented 1 year ago

I am working on this functionality in #201. It would be great if you would help me with testing.

FranklinHu1 commented 1 year ago

Thank you, I will take a look ASAP!

FranklinHu1 commented 1 year ago

Sorry about the delay in testing this, some of my systems were down, but I did give it a shot in specifying the PBCs via this new method. I am am getting the following error in the log file when specifying the PBCs in the yaml configuration file for training a new model:

Traceback (most recent call last):
  File "/global/homes/f/frankhu/torchmd-net/scripts/train.py", line 208, in <module>
    main()
  File "/global/homes/f/frankhu/torchmd-net/scripts/train.py", line 141, in main
    model = LNNP(args, prior_model=prior_models, mean=data.mean, std=data.std)
  File "/global/u2/f/frankhu/torchmd-net/torchmdnet/module.py", line 30, in __init__
    self.model = create_model(self.hparams, prior_model, mean, std)
  File "/global/u2/f/frankhu/torchmd-net/torchmdnet/models/model.py", line 70, in create_model
    representation_model = TorchMD_ET(
  File "/global/u2/f/frankhu/torchmd-net/torchmdnet/models/torchmd_et.py", line 127, in __init__
    self.distance = OptimizedDistance(
  File "/global/u2/f/frankhu/torchmd-net/torchmdnet/models/utils.py", line 203, in __init__
    self.box = self.box.cpu()  # All strategies expect the box to be in CPU memory
AttributeError: 'list' object has no attribute 'cpu'

I have specified the periodic box as follows in my yaml file:

box_vecs: [[7.831,0,0],[0,7.831,0],[0,0,7.831]]

I am a little confused though because in train.py, there is a handler for converting the yaml input to a torch tensor, so I am not sure if this is something that is wrong with my specification of the PBCs or the code to interpret the PBCs.

RaulPPelaez commented 1 year ago

Not sure why that arose but I found more problems down the line with parsing the box as a tensor directly. I updated the PR so that the box_vecs are taken as a list and then transformed to tensor in create_model.

FranklinHu1 commented 12 months ago

Hello!

Sorry for the very long delay in getting back to this, but the changes you made to the PR worked! I have been able to train both the equivariant transformer and tensornet for a few epochs using specific periodic boundary conditions. I will run a longer training to converge the losses and then try running dynamics with these models to see if the entire workflow is successful.

Thanks!

RaulPPelaez commented 11 months ago

Thanks! I will wait for your feedback on dynamics and then we can merge and release.

FranklinHu1 commented 11 months ago

@RaulPPelaez so the training of the equivariant transformer model and compilation to TorchScript for TorchForce works as expected. However, there is a constraint that the box vectors must exist on CPU memory. I get the following error if I attempt to run dynamics on the CUDA platform via openmm on an A100 GPU:

########## READING CLI ARGUMENTS ##########
Setting up save directory for dynamics outputs
########## SETTING UP SIMULATION SYSTEM ##########
Initializing system based on atoms listed in pdb file
Checking system has no forces or constraints
0
Model outputs forces: True
Torchscript module copied
Model uses passed box vectors: True
Traceback (most recent call last):
  File "/pscratch/sd/f/frankhu/openmm_INXS/run_INXS_dynamics.py", line 116, in <module>
    simulation.context.setVelocitiesToTemperature(300 * kelvin)
  File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.10/site-packages/openmm/openmm.py", line 3512, in setVelocitiesToTemperature
    return _openmm.Context_setVelocitiesToTemperature(self, *args)
openmm.OpenMMException: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/torchmdnet/models/utils.py", line 33, in forward
      pass
    else:
      ops.prim.RaiseException(_0)
      ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    max_num_pairs = self.max_num_pairs
    max_num_pairs0 = self.max_num_pairs

Traceback of TorchScript, original code (most recent call last):
  File "/global/u2/f/frankhu/torchmd-net/torchmdnet/models/utils.py", line 234, in forward
        assert box is not None, "Box must be provided"
        box = box.to(pos.dtype)
        assert box.device == torch.device("cpu"), "Box must be in CPU memory"
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        max_pairs : int = self.max_num_pairs
        if self.max_num_pairs < 0:
RuntimeError: AssertionError: Box must be in CPU memory

Is it possible to allow the box to exist in GPU memory so that it can run on the CUDA platform? Running dynamics on the CPU platform works as expected btw, so the overall workflow seems to work except for this issue.

Thanks!

RaulPPelaez commented 11 months ago

Thanks for the feedback! The box must be in CPU memory, but the work happens in the GPU. Being just 9 numbers I find that you can just pass it as an argument to the appropriate kernels: https://github.com/torchmd/torchmd-net/blob/a67ac80299b153390737d8bf8e16670aaf234139/torchmdnet/extensions/neighbors/neighbors_cuda_shared.cuh#L13-L17 However, if OpenMM provides the box in GPU memory it would be a shame to take to the CPU and then back. Let me see if I can make it work in any case.

FranklinHu1 commented 11 months ago

Thanks, that sounds good to me! If this turns out to be more trouble than it's worth, it would be fine if you have a recommended workaround.

FranklinHu1 commented 11 months ago

@RaulPPelaez the dynamics on the CUDA platform using the A100s works for both the equivariant transformer and tensornet models, and they give good agreement with some past AIMD water trajectories on calculations of RDFs and VDOSs. At this point, it appears that the PBC implementation is working and you can move forward with merging it to the main branch.

I've attached the plots below, where ET = equivariant transformer and TNET = tensornet.

Thanks for all the work, and please let me know if there's anything else you'd like me to try out/test!

water_RDF_ET water_RDF_TNET water_VDOS_ET water_VDOS_TNET

RaulPPelaez commented 11 months ago

This is great, thank you so much for your effort! I will close this and merge #201.