Closed AndChenCM closed 7 months ago
The line that you shared is there precisely to avoid the NaN when the distance is zero. It replaces zeros with ones so that 1/0 never happens. It is natural that edge_vec contains zeros, these can come from self interactions (the i,i pair if you'd like) and also "unused" pairs. The code is supposed to ignore those.
I cannot reproduce your issue on a 4090 (sadly I do not have access to an A100), so I am inclined to believe this is either environment or system dependent.
Could you check if this behavior also happens in a fresh environment with just "conda install torchmd-net" installed?
BTW, with "max_gradient" are you referring to the "gradient_clipping" option?
BTW I noticed the SPICE 1.1.4 version was missing in the dataset class. I fixed it here: https://github.com/torchmd/torchmd-net/pull/303
The line that you shared is there precisely to avoid the NaN when the distance is zero. It replaces zeros with ones so that 1/0 never happens. It is natural that edge_vec contains zeros, these can come from self interactions (the i,i pair if you'd like) and also "unused" pairs. The code is supposed to ignore those.
previously I first located that the x and vec output from the representation module contain NaN. If edge_vec is not the problem, should I check neighborhood embedding module or other places?
I cannot reproduce your issue on a 4090 (sadly I do not have access to an A100), so I am inclined to believe this is either environment or system dependent.
Unfortunately I do not have a 4090 : (. To check if it is an environment issue, could you provide a yaml for me to test on?
Could you check if this behavior also happens in a fresh environment with just "conda install torchmd-net" installed?
Sure. I used mamba install torchmd-net
in a fresh environment and mamba install wandb
. It gives me the following environment, which is nearly the same with my previous one:
env-torchmd.txt
Then I use the default ET-SPICE.yaml with version 1.1.3, and command CUDA_VISIBLE_DEVICES=0 torchmd-train --conf examples/ET-SPICE-test.yaml --log-dir outputs/spice-test --wandb-use True --wandb-name spice-test-ET --wandb-project nnp
to run ET on SPICE. I copied the output in my terminal here:
output.txt
Similar to my previous run, wandb does not record any metrics though the training has been going on for a while; I then shut down and use pdb to trace the intermediate outputs, the NaN still exists.
BTW, with "max_gradient" are you referring to the "gradient_clipping" option?
By max_gradient I mean filter out the extra-large forces, like what is specfied in TensorNet-SPICE.yaml, not gradient_clipping.
BTW, I also test it with TensorNet, but the results are same. So this is not a model dependent behaviour
I think I might have some more clues. After setting masks, edge_vec[mask] still have zero vectors, which causes the following computation to have NaN values. Is it related to the warnings in my terminal output saying that You are using a CUDA device ('NVIDIA A100 80GB PCIe') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance
?
Besides, It is weird that wandb does not record anything. Usually I think at least there is a "training loss" showing in terminal.
I am still unable to reproduce.
The tensorcore message you see is just a hint about the possibility you have to trade off a little bit of accuracy in exchange for performance:
torch.set_float32_matmul_precision('medium' | 'high')
https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html
Would you share a specific input to the model that produces NaNs for you? I am talking about a set of positions + atomic numbers.
Also, does it happen also if you run forward using the CPU model? (a.i calling model = model.to(torch.float32)
and similarly with the model inputs)
Additionally, could you confirm the tests pass in your machine?
cd tests
pytest -x -s -v test*py
Additionally, could you confirm the tests pass in your machine?
cd tests pytest -x -s -v test*py
I cannot pass the first test. Here is the output: errors.txt
I am still unable to reproduce. The tensorcore message you see is just a hint about the possibility you have to trade off a little bit of accuracy in exchange for performance:
torch.set_float32_matmul_precision('medium' | 'high')
https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.htmlWould you share a specific input to the model that produces NaNs for you? I am talking about a set of positions + atomic numbers.
I set the batch size to 1, and the NaN occurs at the third sample. The input atomic numbers are
tensor([ 8, 8, 8, 8, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 6,
6, 6, 6, 6, 6, 6, 6, 16, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], device='cuda:0')
; the input pos is all zero:
tensor([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]], device='cuda:0', requires_grad=True).
Emmmm I am not sure what causes the coordinates to be all zeros in this sample, the processing of SPICE is all default by the program.
I see. Out tests do not cover the case of two atoms being at the exact same location. I am unsure whether this is supposed to be a valid input. @guillemsimeon do you think this would make sense in some situation?
I am convinced now that the issue somehow lies in SPICE providing a bogus sample. Could it simply be that your local download of the dataset got corrupted? Try removing the dataset_root folder to force the code to download the DS again.
The SPICE 1.1.3 was freshly downloaded by the code before training yesterday. I can try a fresh download of SPICE 1.1.4 to see if the same issue happened.
I just checked the raw SPICE-1.1.3.hdf5, there seem to be no conformations with coordinates being all zeros.
It seems that the previous processed dataset somehow got corrupted. Re-downloading the dataset fix this issue. Sorry for the trouble!
Dear developers,
I am trying to train ET on SPICE with the newest torchmd-net-2.0, but the training loss becomes NaN at the very begining. I tracked where NaN happened and found that the edge_vec output from the Optimized Distance module here contains zero, which causes NaN in
edge_vec[mask] / torch.norm(edge_vec[mask], dim=1).unsqueeze(1)
. I also found that the input position contains many zeros, but I haven't got time to inspect more on this. I believe with the previous Distance module, I have not encountered this problem when training ET on SPICE. My immediate thought is to to change the OptimizedDistance module back to the Distance module, but my attempt failed as I cannot find a suitable torch_cluster for the new torchmd-net environment.Have you ever encountered this problem or could you reproduce this? I followed the instructions on this page to install torchmd-net 2.0 from source, and below is my current conda environment:
I use mamba 1.5.5 to install all the packages. The test was run on an A100-80G with command
CUDA_VISIBLE_DEVICES=0 python scripts/train.py --conf examples/ET-SPICE.yaml
. Note that the only change I made to the default ET-SPICE.yaml is setversion: 1.1.3
andmax_gradient: 50.94
. The SPICE dataset is not pre-downloaded.Any help on this would be greatly appreciated. Thank you!
Ming-an