Closed FranklinHu1 closed 5 months ago
We think this is a problem with the dataset. The error is in computing the losses, so after the model is executed and the trainer is comparing the model outputs with the labels in your dataset.
Try to import the dataset manually and check the shapes of some sample.
from torchmdnet.datasets import HDF5
ds = HDF5("water_data.hdf")
sample = ds.get(0)
print(sample)
Yes, it does seem to be a problem with the dataset. When I run the suggested code snippet, I get the following output:
>>> from torchmdnet.datasets import HDF5
>>> ds = HDF5("W64_revPBE.h5")
Loading 1 HDF5 files (28.20 MB)
Preloading 1 HDF5 files (28.20 MB)
>>> sample = ds.get(0)
>>> print(sample)
Data(pos=[192, 3], z=[192], y=[6154], neg_dy=[192, 3])
The shape of y is wrong, since there should only be 1 energy per frame, although I do have 6154 frames in total. Each frame has 64 water molecules, so 192 atoms total, which is correct for the position, type embedding, and force.
Right now, my dataset is an h5 file with the following shapes for each of the keys:
energy
: (6154,), i.e. (N_frames,)
forces
: (6154, 192, 3), i.e. (N_frames, N_atoms, direction)
pos
: (6154, 192, 3), i.e. (N_frames, N_atoms, direction)
types
: (6154, 192), i.e. (N_frames, N_atoms)
Is the fix here simply to change the shape of the energies to add an extra dimension, i.e. (6154,) --> (6154, 1)?
Thank you!
I think so. Can you try it and let us know? Btw, I realized that you are using a very small model (0L, 64 hidden channels). I am just curious: does it perform satisfactorily? Also, I saw that you are using mean as the aggregation scheme. Is this for any particular reason? I never tried that.
G
Hi @RaulPPelaez @guillemsimeon,
So sorry for the late response! Yes, reshaping the energies to be (N_frames, 1) resolved the issue. Intuitively this makes sense, but it might be helpful to add some documentation around the hdf5 dataset specifying what shapes everything should be.
Regarding the model size @guillemsimeon, I mostly use this model size because it is quick to train and it works quite well, at least for the bulk water dynamics I am working on right now. The mean aggregation scheme, along with the previous standardization feature, turned out to be the root cause of many of the issues I was having in the past, so using the addition aggregation for everything is definitely the way to go. I will let you know if this model size continues to work for some of the more challenging systems I intend to tackle next.
Thanks again for all your help!
Glad you got it working. I was reading the code to add a comment about this and I believe your usecase should be supported. You found a bug!
This line here: https://github.com/torchmd/torchmd-net/blob/8b472462f212aa58a36c03b26d75900acc09647c/torchmdnet/datasets/hdf.py#L92 Should be:
tmp = tmp.unsqueeze(-1)
You can confirm this is a bug in the cache preloading code by setting
dataset_preload_limit: 0
in your yaml, which skips this code.
Should be fixed by https://github.com/torchmd/torchmd-net/pull/313
Awesome, thanks for looking into this @RaulPPelaez! As far as the model performance goes, this doesn't affect the training or any other operations right?
Should not affect greatly. In principle preloading should be faster, but YMMV. Let me know your experience!
Please feel free to reopen if the issue resurfaces.
Hello,
I am trying to do some tensornet training using the latest version of torchmd-net and a single H100 GPU. However, I encounter the following error:
My generated
hparams.yaml
for this experiment is as follows:My dataset is formatted using the HDF5 format, consisting of boxes of 64 water molecules. I have attached it as a zip file to this issue, along with a npz file of the splits I use for training. I am a little confused because I have successfully run trainings with tensornet in the past using this data and these settings,
Any help would be greatly appreciated. Thank you very much! water_data.zip