Failed to train ET on SPICE with torchmd-net-2.0 #302

Closed AndChenCM closed 7 months ago

AndChenCM commented 7 months ago

Dear developers,

I am trying to train ET on SPICE with the newest torchmd-net-2.0, but the training loss becomes NaN at the very begining. I tracked where NaN happened and found that the edge_vec output from the Optimized Distance module here contains zero, which causes NaN in edge_vec[mask] / torch.norm(edge_vec[mask], dim=1).unsqueeze(1). I also found that the input position contains many zeros, but I haven't got time to inspect more on this. I believe with the previous Distance module, I have not encountered this problem when training ET on SPICE. My immediate thought is to to change the OptimizedDistance module back to the Distance module, but my attempt failed as I cannot find a suitable torch_cluster for the new torchmd-net environment.

Have you ever encountered this problem or could you reproduce this? I followed the instructions on this page to install torchmd-net 2.0 from source, and below is my current conda environment:

I use mamba 1.5.5 to install all the packages. The test was run on an A100-80G with command CUDA_VISIBLE_DEVICES=0 python scripts/ --conf examples/ET-SPICE.yaml. Note that the only change I made to the default ET-SPICE.yaml is set version: 1.1.3 and max_gradient: 50.94. The SPICE dataset is not pre-downloaded.

Any help on this would be greatly appreciated. Thank you!


RaulPPelaez commented 7 months ago

The line that you shared is there precisely to avoid the NaN when the distance is zero. It replaces zeros with ones so that 1/0 never happens. It is natural that edge_vec contains zeros, these can come from self interactions (the i,i pair if you'd like) and also "unused" pairs. The code is supposed to ignore those.

I cannot reproduce your issue on a 4090 (sadly I do not have access to an A100), so I am inclined to believe this is either environment or system dependent.

Could you check if this behavior also happens in a fresh environment with just "conda install torchmd-net" installed?

BTW, with "max_gradient" are you referring to the "gradient_clipping" option?

RaulPPelaez commented 7 months ago

BTW I noticed the SPICE 1.1.4 version was missing in the dataset class. I fixed it here:

AndChenCM commented 7 months ago

The line that you shared is there precisely to avoid the NaN when the distance is zero. It replaces zeros with ones so that 1/0 never happens. It is natural that edge_vec contains zeros, these can come from self interactions (the i,i pair if you'd like) and also "unused" pairs. The code is supposed to ignore those.

previously I first located that the x and vec output from the representation module contain NaN. If edge_vec is not the problem, should I check neighborhood embedding module or other places? image

I cannot reproduce your issue on a 4090 (sadly I do not have access to an A100), so I am inclined to believe this is either environment or system dependent.

Unfortunately I do not have a 4090 : (. To check if it is an environment issue, could you provide a yaml for me to test on?

Could you check if this behavior also happens in a fresh environment with just "conda install torchmd-net" installed?

Sure. I used mamba install torchmd-net in a fresh environment and mamba install wandb. It gives me the following environment, which is nearly the same with my previous one: env-torchmd.txt

Then I use the default ET-SPICE.yaml with version 1.1.3, and command CUDA_VISIBLE_DEVICES=0 torchmd-train --conf examples/ET-SPICE-test.yaml --log-dir outputs/spice-test --wandb-use True --wandb-name spice-test-ET --wandb-project nnp to run ET on SPICE. I copied the output in my terminal here: output.txt

Similar to my previous run, wandb does not record any metrics though the training has been going on for a while; I then shut down and use pdb to trace the intermediate outputs, the NaN still exists. image

BTW, with "max_gradient" are you referring to the "gradient_clipping" option?

By max_gradient I mean filter out the extra-large forces, like what is specfied in TensorNet-SPICE.yaml, not gradient_clipping.

AndChenCM commented 7 months ago

BTW, I also test it with TensorNet, but the results are same. So this is not a model dependent behaviour

AndChenCM commented 7 months ago

I think I might have some more clues. After setting masks, edge_vec[mask] still have zero vectors, which causes the following computation to have NaN values. Is it related to the warnings in my terminal output saying that You are using a CUDA device ('NVIDIA A100 80GB PCIe') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance? image image

Besides, It is weird that wandb does not record anything. Usually I think at least there is a "training loss" showing in terminal.

RaulPPelaez commented 7 months ago

I am still unable to reproduce.
The tensorcore message you see is just a hint about the possibility you have to trade off a little bit of accuracy in exchange for performance: torch.set_float32_matmul_precision('medium' | 'high')

Would you share a specific input to the model that produces NaNs for you? I am talking about a set of positions + atomic numbers.

RaulPPelaez commented 7 months ago

Also, does it happen also if you run forward using the CPU model? (a.i calling model = and similarly with the model inputs)

RaulPPelaez commented 7 months ago

Additionally, could you confirm the tests pass in your machine?

cd tests
pytest -x -s -v test*py
AndChenCM commented 7 months ago

Additionally, could you confirm the tests pass in your machine?

cd tests
pytest -x -s -v test*py

I cannot pass the first test. Here is the output: errors.txt

AndChenCM commented 7 months ago

I am still unable to reproduce. The tensorcore message you see is just a hint about the possibility you have to trade off a little bit of accuracy in exchange for performance: torch.set_float32_matmul_precision('medium' | 'high')

Would you share a specific input to the model that produces NaNs for you? I am talking about a set of positions + atomic numbers.

I set the batch size to 1, and the NaN occurs at the third sample. The input atomic numbers are

tensor([ 8,  8,  8,  8,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  7,  6,
         6,  6,  6,  6,  6,  6,  6, 16,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1], device='cuda:0')

; the input pos is all zero:

tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]], device='cuda:0', requires_grad=True). 

Emmmm I am not sure what causes the coordinates to be all zeros in this sample, the processing of SPICE is all default by the program.

RaulPPelaez commented 7 months ago

I see. Out tests do not cover the case of two atoms being at the exact same location. I am unsure whether this is supposed to be a valid input. @guillemsimeon do you think this would make sense in some situation?

I am convinced now that the issue somehow lies in SPICE providing a bogus sample. Could it simply be that your local download of the dataset got corrupted? Try removing the dataset_root folder to force the code to download the DS again.

AndChenCM commented 7 months ago

The SPICE 1.1.3 was freshly downloaded by the code before training yesterday. I can try a fresh download of SPICE 1.1.4 to see if the same issue happened.

AndChenCM commented 7 months ago

I just checked the raw SPICE-1.1.3.hdf5, there seem to be no conformations with coordinates being all zeros.

AndChenCM commented 7 months ago

It seems that the previous processed dataset somehow got corrupted. Re-downloading the dataset fix this issue. Sorry for the trouble!