Open rashigeek opened 1 year ago
Hi @rashigeek
What is the batch_size
that you are using?
Lower batch_size tends to under-utilize GPUs.
I have tried a wide range of batch sizes even really big batch sizes such as 1028 but the performance was unaffected. I even tried passing batch_size as an argument and the problem still persisted.
I'm having the same problem, I can't use the GPU when running run_alignn_ff.py
.
I am trying to train a force fields model by using a variation of the following command that is mentioned in the readme to match my directories:
train_folder_ff.py --root_dir "alignn/examples/sample_data_ff" --config "alignn/examples/sample_data_ff/config_example_atomwise.json" --output_dir=temp
However, training is super slow and does not seem to utilize the GPU at all. This can be further confirmed by runningnvidia-smi
and viewing the output during training: +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.43.02 Driver Version: 535.43.02 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3060 Off | 00000000:07:00.0 Off | N/A | | 0% 42C P8 13W / 170W | 71MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------++---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 1405 G /usr/lib/xorg/Xorg 56MiB | | 0 N/A N/A 1571 G /usr/bin/gnome-shell 5MiB | +---------------------------------------------------------------------------------------+
If I am training a model that does not utilize force fields, the GPU is used. For example, running
train_folder.py --root_dir "alignn/examples/sample_data" --config "alignn/examples/sample_data/config_example.json" --output_dir=temp
and simultanously runningnvidia-smi
gives the following output: +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.43.02 Driver Version: 535.43.02 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3060 Off | 00000000:07:00.0 Off | N/A | | 0% 46C P2 62W / 170W | 921MiB / 12288MiB | 39% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------++---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 1405 G /usr/lib/xorg/Xorg 56MiB | | 0 N/A N/A 1571 G /usr/bin/gnome-shell 5MiB | | 0 N/A N/A 29095 C .../miniconda3/envs/version/bin/python 848MiB | +---------------------------------------------------------------------------------------+
I have done my best to check that all the dependencies are compatible and I can confirm that the device is switched to cuda in the train_folder_ff.py script.