usnistgov / alignn

Atomistic Line Graph Neural Network https://scholar.google.com/citations?user=9Q-tNnwAAAAJ&hl=en
https://jarvis.nist.gov/jalignn/
Other
194 stars 79 forks source link

train_folder_ff does not utilize GPU #115

Open rashigeek opened 1 year ago

rashigeek commented 1 year ago

I am trying to train a force fields model by using a variation of the following command that is mentioned in the readme to match my directories:

train_folder_ff.py --root_dir "alignn/examples/sample_data_ff" --config "alignn/examples/sample_data_ff/config_example_atomwise.json" --output_dir=temp However, training is super slow and does not seem to utilize the GPU at all. This can be further confirmed by running nvidia-smi and viewing the output during training: +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.43.02 Driver Version: 535.43.02 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3060 Off | 00000000:07:00.0 Off | N/A | | 0% 42C P8 13W / 170W | 71MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 1405 G /usr/lib/xorg/Xorg 56MiB | | 0 N/A N/A 1571 G /usr/bin/gnome-shell 5MiB | +---------------------------------------------------------------------------------------+





If I am training a model that does not utilize force fields, the GPU is used. For example, running train_folder.py --root_dir "alignn/examples/sample_data" --config "alignn/examples/sample_data/config_example.json" --output_dir=temp and simultanously running nvidia-smi gives the following output: +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.43.02 Driver Version: 535.43.02 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3060 Off | 00000000:07:00.0 Off | N/A | | 0% 46C P2 62W / 170W | 921MiB / 12288MiB | 39% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 1405 G /usr/lib/xorg/Xorg 56MiB | | 0 N/A N/A 1571 G /usr/bin/gnome-shell 5MiB | | 0 N/A N/A 29095 C .../miniconda3/envs/version/bin/python 848MiB | +---------------------------------------------------------------------------------------+

I have done my best to check that all the dependencies are compatible and I can confirm that the device is switched to cuda in the train_folder_ff.py script.

knc6 commented 1 year ago

Hi @rashigeek

What is the batch_size that you are using?

Lower batch_size tends to under-utilize GPUs.

rashigeek commented 1 year ago

I have tried a wide range of batch sizes even really big batch sizes such as 1028 but the performance was unaffected. I even tried passing batch_size as an argument and the problem still persisted.

ChemZhihaoWang commented 5 months ago

I'm having the same problem, I can't use the GPU when running run_alignn_ff.py.

knc6 commented 5 months ago

I am not able to reproduce this issue. Running this example on colab:

image