Closed RaulPPelaez closed 11 months ago
I think making it such that you can call the model with model(z, pos, batch, box)
would be best, especially because when integrating with openmm torch to run dynamics, I could then design the ForceModule
such that it takes in the box as an argument:
class ForceModule(torch.nn.Module):
def forward(self, positions, box_vectors):
...
energy, force = self.model.forward(self.z, positions, box_vectors)
In terms of training, it would be great if the box vectors could be an argument that is interpreted from the yaml configuration file. This is a simple change to train.py
and the configuration yaml file, something like:
...
pbc_box : [12.4, 12.4, 12.4]
...
and in train.py
:
...
parser.add_argument('--pbc-box', ...)
...
Obviously I then convert the box lengths into a diagonal (3, 3) matrix for the vectors.
In summary, I think it would be very helpful if:
model(z, pos, batch, box)
Thank you very much, and please let me know if anything needs to be clarified!
Hello,
Just checking in, has there been any movement on this issue?
Thank you!
Sorry, I have not gotten around it yet. I will get to it ASAP. I will probably implement it just as you designed. If two boxes are provided (yaml + argument to forward) the one in forward takes precedence.
There is the issue of where the box should live. Sometimes one really needs the box information CPU side (i.e when building a cell list), and it is small enough that one can communicate it to the GPU without a memory copy (as an argument to a CUDA kernel). I believe the box should be a 3x3 tensor that resides in CPU memory. Up until now the box is something that cannot change, so the model can store it however it likes.
That sounds great, thank you!
I am working on this functionality in #201. It would be great if you would help me with testing.
Thank you, I will take a look ASAP!
Sorry about the delay in testing this, some of my systems were down, but I did give it a shot in specifying the PBCs via this new method. I am am getting the following error in the log file when specifying the PBCs in the yaml configuration file for training a new model:
Traceback (most recent call last):
File "/global/homes/f/frankhu/torchmd-net/scripts/train.py", line 208, in <module>
main()
File "/global/homes/f/frankhu/torchmd-net/scripts/train.py", line 141, in main
model = LNNP(args, prior_model=prior_models, mean=data.mean, std=data.std)
File "/global/u2/f/frankhu/torchmd-net/torchmdnet/module.py", line 30, in __init__
self.model = create_model(self.hparams, prior_model, mean, std)
File "/global/u2/f/frankhu/torchmd-net/torchmdnet/models/model.py", line 70, in create_model
representation_model = TorchMD_ET(
File "/global/u2/f/frankhu/torchmd-net/torchmdnet/models/torchmd_et.py", line 127, in __init__
self.distance = OptimizedDistance(
File "/global/u2/f/frankhu/torchmd-net/torchmdnet/models/utils.py", line 203, in __init__
self.box = self.box.cpu() # All strategies expect the box to be in CPU memory
AttributeError: 'list' object has no attribute 'cpu'
I have specified the periodic box as follows in my yaml file:
box_vecs: [[7.831,0,0],[0,7.831,0],[0,0,7.831]]
I am a little confused though because in train.py
, there is a handler for converting the yaml input to a torch tensor, so I am not sure if this is something that is wrong with my specification of the PBCs or the code to interpret the PBCs.
Not sure why that arose but I found more problems down the line with parsing the box as a tensor directly. I updated the PR so that the box_vecs are taken as a list and then transformed to tensor in create_model.
Hello!
Sorry for the very long delay in getting back to this, but the changes you made to the PR worked! I have been able to train both the equivariant transformer and tensornet for a few epochs using specific periodic boundary conditions. I will run a longer training to converge the losses and then try running dynamics with these models to see if the entire workflow is successful.
Thanks!
Thanks! I will wait for your feedback on dynamics and then we can merge and release.
@RaulPPelaez so the training of the equivariant transformer model and compilation to TorchScript for TorchForce works as expected. However, there is a constraint that the box vectors must exist on CPU memory. I get the following error if I attempt to run dynamics on the CUDA platform via openmm on an A100 GPU:
########## READING CLI ARGUMENTS ##########
Setting up save directory for dynamics outputs
########## SETTING UP SIMULATION SYSTEM ##########
Initializing system based on atoms listed in pdb file
Checking system has no forces or constraints
0
Model outputs forces: True
Torchscript module copied
Model uses passed box vectors: True
Traceback (most recent call last):
File "/pscratch/sd/f/frankhu/openmm_INXS/run_INXS_dynamics.py", line 116, in <module>
simulation.context.setVelocitiesToTemperature(300 * kelvin)
File "/global/homes/f/frankhu/.conda/envs/torchmd-net/lib/python3.10/site-packages/openmm/openmm.py", line 3512, in setVelocitiesToTemperature
return _openmm.Context_setVelocitiesToTemperature(self, *args)
openmm.OpenMMException: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/torchmdnet/models/utils.py", line 33, in forward
pass
else:
ops.prim.RaiseException(_0)
~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
max_num_pairs = self.max_num_pairs
max_num_pairs0 = self.max_num_pairs
Traceback of TorchScript, original code (most recent call last):
File "/global/u2/f/frankhu/torchmd-net/torchmdnet/models/utils.py", line 234, in forward
assert box is not None, "Box must be provided"
box = box.to(pos.dtype)
assert box.device == torch.device("cpu"), "Box must be in CPU memory"
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
max_pairs : int = self.max_num_pairs
if self.max_num_pairs < 0:
RuntimeError: AssertionError: Box must be in CPU memory
Is it possible to allow the box to exist in GPU memory so that it can run on the CUDA platform? Running dynamics on the CPU platform works as expected btw, so the overall workflow seems to work except for this issue.
Thanks!
Thanks for the feedback! The box must be in CPU memory, but the work happens in the GPU. Being just 9 numbers I find that you can just pass it as an argument to the appropriate kernels: https://github.com/torchmd/torchmd-net/blob/a67ac80299b153390737d8bf8e16670aaf234139/torchmdnet/extensions/neighbors/neighbors_cuda_shared.cuh#L13-L17 However, if OpenMM provides the box in GPU memory it would be a shame to take to the CPU and then back. Let me see if I can make it work in any case.
Thanks, that sounds good to me! If this turns out to be more trouble than it's worth, it would be fine if you have a recommended workaround.
@RaulPPelaez the dynamics on the CUDA platform using the A100s works for both the equivariant transformer and tensornet models, and they give good agreement with some past AIMD water trajectories on calculations of RDFs and VDOSs. At this point, it appears that the PBC implementation is working and you can move forward with merging it to the main branch.
I've attached the plots below, where ET = equivariant transformer and TNET = tensornet.
Thanks for all the work, and please let me know if there's anything else you'd like me to try out/test!
This is great, thank you so much for your effort! I will close this and merge #201.
@franklinHu1 You mentioned this functionality would be useful to your use case. Could you please elaborate on your requirements? Would it be enough to have the possibility of changing the box between training and inference or you would rather have the possibility of calling the models with
model(z, pos, batch, box)
?