Open JonathanSchmidt1 opened 2 years ago
Hi @JonathanSchmidt1 ,
Thanks for your interest in our code/method for your project! Sounds like an interesting application; please feel free to get in touch by email and let us know how it's going (we're always interested to hear about what people are working on using our methods).
Re multi-GPU training: I have a draft branch horovod
using the Horovod distributed training framework. This is an in-progress draft, and has only been successfully tested so for for a few epochs on multiple CPUs. The branch is also a little out-of-sync with the latest version, but I will try to merge that back in in the coming days. If you are interested, you are more than welcome to use this branch, just understanding that it would as a sort of an "alpha tester." If you do use the branch, please carefully check any results you get for sanity and against those with Horovod disabled, and also please report any issues/suspicions here or by email. (One disclaimer is that the horovod
branch is not a development priority for us this summer and I will likely be slow to respond.) PRs are also welcome, though I appreciate people reaching out to discuss first if the PR involves major development or restructuring.
PyTorch Lightning is a lot more difficult to integrate with. Getting a simple training loop going would be easy, but it would use a different configuration file, and integrating it with the full set of important nequip
features, such as correctly calculated and averaged metrics, careful data normalization, EMA, correct global numerical precision and JIT settings, etc., etc. would be difficult and involve a lot of subtle stumbling blocks we have already dealt with in the nequip
code. For this reason I would really recommend against this path unless you want to deal carefully with all of this. (If you do, of course, it would be great if you could share that work!)
Thanks!
OK, I've merged the latest develop
-> horovod
, see https://github.com/mir-group/nequip/pull/211.
If you try this, please run the Horovod unit tests tests/integration/test_train_horovod.py
and confirm that they (1) are not skipped (i.e. horovod is installed) and (2) pass.
thank you very much. I will see how it goes.
As usual, other things got in the way but I could finally test it. Running tests/integration/test_train_horovod.py worked. I also confirmed that the normal training on gpu worked (nequip-train configs/minimal.yaml).
Now if I run with --horovod the training of the first epoch seems fine but there is a problem with the metrics. I checked the torch_runstats lib and could not find any get_state, are you maybe using a modified version?
Epoch batch loss loss_f f_mae f_rmse
0 1 1.06 1.06 24.3 32.5
Traceback (most recent call last):
File "/home/test_user/.conda/envs/nequip2/bin/nequip-train", line 33, in
Hi @JonathanSchmidt1 ,
Surprised that the tests run if the training won't... that sounds like a sign that the tests are broken 😄
Whoops yes I forgot to mention, I haven't merged the code I was writing to enable multi-GPU training in torch_runstats
yet; you can find it on the branch https://github.com/mir-group/pytorch_runstats/tree/state-reduce.
Thank you that fixed it for one gpu.
horovodrun -np 1 nequip-train configs/example.yaml --horovod
works now.
If I use two gpus we get an error message as some tensors during the metric evaluation are on the wrong devices.
File "/raid/scratch/testuser/nequip/nequip/train/trainer.py", line 993, in epoch_step
[1,0]
I checked and "n" and "state" are on cuda:1 and "self._state", "self._n" are on cuda:0 . Not sure how it's supposed to be. Are they all expected to be on cuda:0 for this step or all on their own gpu?
Aha... here's that "this is very untested" 😁 I think PyTorch / Horovod may be too smart for its own good and reloading transmitted tensors onto different CUDA devices when they are all available to the same host... I will look into this when I get a chance.
That would be great, I will also try to find the time to look into it but I think I will need some time to understand the whole codebase.
I thought reviving the issues might be more convenient than continuing by email. So some quick notes about some issues I noticed when testing the ddp branch.
Every process seems to get its own wandb log. It's not possible to restart because wandb finds an existing run in each process and then crashes.
Sometimes random crash after a few 100 epochs have no idea yet why. Was also not reproducible.
-WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 215968 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 215970 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 215971 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -15) local_rank: 1 (pid: 215969) of binary: /home/test_user/.conda/envs/nequip2/bin/python
Traceback (most recent call last):
File "/home/test_user/.conda/envs/nequip2/bin/torchrun", line 8, in
At the moment each process seems to load the network on each gpu e.g. running with 8 gpus I get this output from nvidia-smi:
0 N/A N/A 804401 C ...a/envs/nequip2/bin/python 18145MiB | | 0 N/A N/A 804402 C ...a/envs/nequip2/bin/python 1499MiB | | 0 N/A N/A 804403 C ...a/envs/nequip2/bin/python 1499MiB | | 0 N/A N/A 804404 C ...a/envs/nequip2/bin/python 1499MiB | | 0 N/A N/A 804405 C ...a/envs/nequip2/bin/python 1499MiB | | 0 N/A N/A 804406 C ...a/envs/nequip2/bin/python 1499MiB | | 0 N/A N/A 804407 C ...a/envs/nequip2/bin/python 1499MiB | | 0 N/A N/A 804408 C ...a/envs/nequip2/bin/python 1499MiB | | 1 N/A N/A 804401 C ...a/envs/nequip2/bin/python 1499MiB | | 1 N/A N/A 804402 C ...a/envs/nequip2/bin/python 19101MiB | | 1 N/A N/A 804403 C ...a/envs/nequip2/bin/python 1499MiB | | 1 N/A N/A 804404 C ...a/envs/nequip2/bin/python 1499MiB | | 1 N/A N/A 804405 C ...a/envs/nequip2/bin/python 1499MiB | | 1 N/A N/A 804406 C ...a/envs/nequip2/bin/python 1499MiB | | 1 N/A N/A 804407 C ...a/envs/nequip2/bin/python 1499MiB | | 1 N/A N/A 804408 C ...a/envs/nequip2/bin/python 1499MiB | | 2 N/A N/A 804401 C ...a/envs/nequip2/bin/python 1499MiB | | 2 N/A N/A 804402 C ...a/envs/nequip2/bin/python 1499MiB | | 2 N/A N/A 804403 C ...a/envs/nequip2/bin/python 17937MiB | | 2 N/A N/A 804404 C ...a/envs/nequip2/bin/python 1499MiB | | 2 N/A N/A 804405 C ...a/envs/nequip2/bin/python 1499MiB | | 2 N/A N/A 804406 C ...a/envs/nequip2/bin/python 1499MiB | | 2 N/A N/A 804407 C ...a/envs/nequip2/bin/python 1499MiB | | 2 N/A N/A 804408 C ...a/envs/nequip2/bin/python 1499MiB | | 3 N/A N/A 804401 C ...a/envs/nequip2/bin/python 1499MiB | | 3 N/A N/A 804402 C ...a/envs/nequip2/bin/python 1499MiB | | 3 N/A N/A 804403 C ...a/envs/nequip2/bin/python 1499MiB | ......
Hi @JonathanSchmidt1 ,
Thanks!
Every process seems to get its own wandb log. It's not possible to restart because wandb finds an existing run in each process and then crashes.
Hm yes... this one will be a little nontrivial, since need to not only prevent wandb
init on other ranks but probably also sync the wandb updated config to the nonzero ranks.
Sometimes random crash after a few 100 epochs have no idea yet why. Was also not reproducible.
Weird... usually when we see something like this it means out-of-memory, or that the cluster's scheduler went crazy.
At the moment each process seems to load the network on each gpu e.g. running with 8 gpus I get this output from nvidia-smi:
Not sure exactly what I'm looking at here, but yes, every GPU will get its own copy of the model as hinted by the name "Distributed Data Parallel"
Out of memory errors could make sense and might be connected to the last issue as with the same batch size per GPU I did not produce OOM errors when running on a single gpu.
The output basically says that each worker process uses up memory (most likely a copy of the model) on each gpu, however with DDP each worker is supposed to have a copy only on its own gpu. Then gradient updates are sent all-to-all. Basically I would expect the output to look like this from previous experience with ddp: 0 N/A N/A 804401 C ...a/envs/nequip2/bin/python 18145MiB | 1 N/A N/A 804402 C ...a/envs/nequip2/bin/python 19101MiB | 2 N/A N/A 804403 C ...a/envs/nequip2/bin/python 17937MiB |
I'd also be very interested in this feature. I have access to a system with four A100s on each node. Being able to use all four would make training go a lot faster.
I spend some time debugging the issue and it seems that the metrics.gather and loss.gather calls cause the extra processes to spawn. If I remove these calls it's only one process per gpu and I can scale to 16 gpus (before it would run oom because of the extra processes). However continuing the training after stopping still somehow causes extra processes to spawn but just on the zeroth gpu.
Hi all,
Any updates on this feature? I also have some rather large datasets.
Just a small update. As I had access to a different cluster with HOROVOD I tested the horovod branch again and with the fixed runstats version and a few small changes it ran without the issues of the ddp version. I also got descent speedups, despite using single gpu nodes. N_nodes (1 P100 per node) [1, 2, 4, 8, 16, 32] [1.0, 1.6286277105250644, 3.3867286549788127, 6.642094103901569, 9.572247883815873, 17.38443770824977] ps: I did not confirm whether the loss is the same for different node numbers yet for HOROVOD
Hi @JonathanSchmidt1,
Did you also receive a message like this when using the horovod branch on 2 gpus:
[1,0]<stderr>:Processing dataset...
[1,1]<stderr>:Processing dataset...
The dataset processing only seems to happen in process for me, so I only get the message once. Anyway if that is causing problems for you it might work to process the dataset before and then start the training. Ps: I have tested some of the models now and the loss reported during training seems correct.
Hi,
I am also quite interested in the multi-GPU training capbility. I did some tests with the ddp branch using PyTorch 2.1.1 up to 16 GPUs (4 V100 per node) on a dataset with ~5k configurations. In all my tests I achieved the same results compared to a single GPU reference. I was wondering whether this feature is still under active development and if there is any plan to merge it with the develop branch ?
Hi @sklenard,
I am trying to utilizing the multi-GPU feature, but I have some trouble with it.
I install the ddp branch with pytorch 2.1.1 by changing
"torch>=1.8,<=1.12,!=1.9.0", # torch.fx added in 1.8
to "torch>=1.8,<=2.1.1,!=1.9.0", # torch.fx added in 1.8
in ''setup.py'' nequip folder.
in this way, ddp branch can be installed without any error. However, when I try to run nequip-train, i get this error:
[W init.cpp:842] Warning: Use _jit_set_fusion_strategy, bailout depth is deprecated. Setting to (STATIC, 2) (function operator())
Traceback (most recent call last):
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/bin/nequip-train", line 8, in <module>
sys.exit(main())
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/nequip/scripts/train.py", line 76, in main
trainer = fresh_start(config)
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/nequip/scripts/train.py", line 189, in fresh_start
config = init_n_update(config)
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/nequip/utils/wandb.py", line 17, in init_n_update
wandb.init(
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1200, in init
raise e
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1177, in init
wi.setup(kwargs)
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 190, in setup
self._wl = wandb_setup.setup(settings=setup_settings)
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 327, in setup
ret = _setup(settings=settings)
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 320, in _setup
wl = _WandbSetup(settings=settings)
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 303, in __init__
_WandbSetup._instance = _WandbSetup__WandbSetup(settings=settings, pid=pid)
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 114, in __init__
self._setup()
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 250, in _setup
self._setup_manager()
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 277, in _setup_manager
self._manager = wandb_manager._Manager(settings=self._settings)
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py", line 139, in __init__
self._service.start()
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 250, in start
self._launch_server()
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 244, in _launch_server
_sentry.reraise(e)
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/analytics/sentry.py", line 154, in reraise
raise exc.with_traceback(sys.exc_info()[2])
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 242, in _launch_server
self._wait_for_ports(fname, proc=internal_proc)
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 132, in _wait_for_ports
raise ServiceStartTimeoutError(
wandb.sdk.service.service.ServiceStartTimeoutError: Timed out waiting for wandb service to start after 30.0 seconds. Try increasing the timeout with the `_service_wait` setting.
it seems that there is something wrong with wandb. I wonder how you install this branch, Maybe there is some difference between the version you installed and I installed since more than 2 months had passed. It would be great if you could recall and tell how you installed or share the version you installed. Thank you very much!
@beidouamg this looks like a network error unrelated to the ddp
branch, but maybe there is a race condition. Have you tried to run without wandb
enabled?
@JonathanSchmidt1 I'm trying to run multi-GPU testing now using the ddp
branch (based on the horovod
branch) as this is now under active development. For this:
I spend some time debugging the issue and it seems that the metrics.gather and loss.gather calls cause the extra processes to spawn. If I remove these calls it's only one process per gpu and I can scale to 16 gpus (before it would run oom because of the extra processes). However continuing the training after stopping still somehow causes extra processes to spawn but just on the zeroth GPU. So if you comment out these calls, is it still working as expected? Or were there other changes you made?
You mentioned that you got it working with this, the updated pytorch_runstats
and some other small changes. I'm currently trying to do this, and seem to have the multi-GPU training up and running with the ddp
branch, but the training seems to be going quite slow (i.e. with 2 GPUs and batch_size: 4
it's 50% slower than 1 GPU with batch_size: 5
– had to change batch size to make it divisible by num GPUs). If I print nvidia-smi
on the compute node I get:
(base) Perlmutter: sean > nvidia-smi
Wed Jul 10 14:15:52 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:03:00.0 Off | 0 |
| N/A 34C P0 82W / 400W | 4063MiB / 40960MiB | 23% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:41:00.0 Off | 0 |
| N/A 34C P0 95W / 400W | 2735MiB / 40960MiB | 76% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:82:00.0 Off | 0 |
| N/A 35C P0 112W / 400W | 2771MiB / 40960MiB | 70% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:C1:00.0 Off | 0 |
| N/A 35C P0 93W / 400W | 2673MiB / 40960MiB | 78% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 736896 C ...lti_gpu_nequip/bin/python 2740MiB |
| 0 N/A N/A 736897 C ...lti_gpu_nequip/bin/python 440MiB |
| 0 N/A N/A 736898 C ...lti_gpu_nequip/bin/python 440MiB |
| 0 N/A N/A 736899 C ...lti_gpu_nequip/bin/python 440MiB |
| 1 N/A N/A 736897 C ...lti_gpu_nequip/bin/python 2732MiB |
| 2 N/A N/A 736898 C ...lti_gpu_nequip/bin/python 2768MiB |
| 3 N/A N/A 736899 C ...lti_gpu_nequip/bin/python 2670MiB |
+-----------------------------------------------------------------------------+
which seems to be intermediate between what you posted before with horovod
(X processes each on each of X GPUs) and what you said should happen (1 process each), here I get X processes on GPU 0, 1 process on each other GPU.
I tried commenting out the metrics.gather()
and loss.gather()
methods as you suggested above, but this doesn't seem to have made any difference to the run times or the nvidia-smi
output 🤔
@kavanase, I'm also involved in this issue, is there anyway you could share your run (or nequip-train
) command to get the ddp branch to actually work on multiple GPUs?
Hi @rschireman, sorry for the delay in replying! This is the current job script I'm using:
#!/bin/bash
#SBATCH -J Nequip_training_
#SBATCH -C gpu
#SBATCH -q shared
#SBATCH -N 1 # nodes
#SBATCH --ntasks-per-node=2 # one per GPU
#SBATCH -c 32
#SBATCH --gres=gpu:2 # GPUs per node
#SBATCH -t 0-02:40 # runtime in D-HH:MM, minimum of 10 minutes
#SBATCH --output=stdout_%j.txt
#SBATCH --error=stderr_%j.txt
master_port=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
export MASTER_PORT=$master_port
# - Master node address
master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr
world_size=$(($SLURM_NTASKS_PER_NODE * $SLURM_NNODES))
export nproc_per_node=$SLURM_NTASKS_PER_NODE
echo "MASTER_ADDR="$master_addr
echo "MASTER_PORT="$master_port
echo "WORLD_SIZE="$world_size
echo "NNODES="$SLURM_NNODES
echo "NODE LIST="$SLURM_JOB_NODELIST
echo "NPROC_PER_NODE="$nproc_per_node
echo "CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}"
echo "PYTHON VERSION=$(python --version)"
ngpu=2 # can later set this to an environment variable
source ~/.bashrc
export LANG=en_US.utf8
export LC_ALL=en_US.utf8
source activate multi_gpu_nequip
source export_DDP_vars.sh
export PYTORCH_VERSION_WARNING=0
torchrun --nnodes 1 --nproc_per_node $ngpu `which nequip-train` nequip*.yaml --distributed
This is running on NERSC Perlmutter which uses Slurm as the scheduler. I'm not sure which settings here are actually necessary for the job to run, as I'm still in the trial and error stage and plan to prune down to figure out which ones are actually needed, once I get some consistency in the jobs running. Some of these choices were motivated by what I read here:
My export_DDP_vars.sh
is: (slightly modified from the nersc-dl-wandb
one)
export RANK=$SLURM_PROCID
export WORLD_RANK=$SLURM_PROCID
export LOCAL_RANK=$SLURM_LOCALID
export WORLD_SIZE=$SLURM_NTASKS
#export MASTER_PORT=29500 # default from torch launcher
export WANDB_START_METHOD="thread"
This now seems to be mostly up and running, but as mentioned above it currently seems slower than expected and I'm not sure if the rank distribution shown in the nvidia-smi
output is as it should be... Still testing this out.
As noted in https://github.com/mir-group/nequip/pull/450, the state-reduce
branch of pytorch_runstats
also currently needs to be used with the DDP branch. In the above links, it is also recommended to use srun
rather than torchrun
, but this was causing issues for me at first, but I will try switching back to srun
to see if I can get it working properly.
Currently I'm seeing some runs failing apparently randomly, with these being some error outputs I'm getting:
2024-07-11 03:48:21,160] torch.distributed.run: [WARNING]
[2024-07-11 03:48:21,160] torch.distributed.run: [WARNING] *****************************************
[2024-07-11 03:48:21,160] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, pl
ease further tune the variable for optimal performance in your application as needed.
[2024-07-11 03:48:21,160] torch.distributed.run: [WARNING] *****************************************
Using `torch.distributed`; this is rank 0/2 (local rank: 0)
Using `torch.distributed`; this is rank 1/2 (local rank: 1)
Torch device: cuda
Number of weights: 809016
Number of trainable weights: 809016
Traceback (most recent call last):
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/bin/nequip-train", line 33, in <module>
sys.exit(load_entry_point('nequip', 'console_scripts', 'nequip-train')())
^^^^^^^^^^^^^Traceback (most recent call last):
^^^^ File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/bin/nequip-train", line 33, in <module>
^^^^^^^^^^^ ^^sys.exit(load_entry_point('nequip', 'console_scripts', 'nequip-train')())
^^ ^^ ^^ ^^ ^^ ^^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^ File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/scripts/train.py", line 113, in main
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^trainer = restart(config)
^^ ^^ ^^ ^
File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/scripts/train.py", line 113, in main
^^^^ ^^trainer = restart(config)
^^ ^^ ^^ ^^ ^
File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/scripts/train.py", line 372, in restart
^^^^^^^ ^^trainer = Trainer.from_dict(dictionary)
^^ ^^ ^^
File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/scripts/train.py", line 372, in restart
^^ ^^trainer = Trainer.from_dict(dictionary)
^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^^^^^^^^^^^^^^^^^
^^^^ File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/train/trainer.py", line 697, in from_dict
^^^^^^^^^^^^^^^^
File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/train/trainer.py", line 697, in from_dict
trainer = cls(model=model, **dictionary)
trainer = cls(model=model, **dictionary)
^^ ^^ ^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^ File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/train/trainer.py", line 412, in __init__
^^^
File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/train/trainer.py", line 412, in __init__
self.init()
self.init() File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/train/trainer.py", line 785, in init
File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/train/trainer.py", line 785, in init
self.init_objects()
File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/train/trainer.py", line 431, in init_objects
self.init_objects()
File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/train/trainer.py", line 431, in init_objects
self.model = torch.nn.parallel.DistributedDataParallel(self.model)
self.model = torch.nn.parallel.DistributedDataParallel(self.model)
^^ ^^ ^^ ^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^ File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 798, in __init__
^^^^^
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 798, in __init__
_verify_param_shape_across_processes(self.process_group, parameters)
_verify_param_shape_across_processes(self.process_group, parameters)
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/utils.py", line 263, in _verify_param_shape_across_processes
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/utils.py", line 263, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
return dist._verify_params_across_processes(process_group, tensors, logger)
^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^torch.distributed
.DistBackendErrortorch.distributed: NCCL error in: /opt/conda/conda-bld/pytorch_1708025845868/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal error - please report this issue to the NCCL developers, NCCL version 2.19.3
ncclInternalError: Internal check failed.
Last error:
Attribute busid of node nic not found.DistBackendError
: NCCL error in: /opt/conda/conda-bld/pytorch_1708025845868/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal error - please report this issue to the NCCL developers, NCCL version 2.19.3
ncclInternalError: Internal check failed.
Last error:
Attribute busid of node nic not found
[2024-07-11 03:48:46,175] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 947123) of binary: /global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/bin/python
Traceback (most recent call last):
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.2.1', 'console_scripts', 'torchrun')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
or
[2024-07-10 14:49:44,231] torch.distributed.run: [WARNING] *****************************************
[2024-07-10 14:49:44,231] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-07-10 14:49:44,231] torch.distributed.run: [WARNING] *****************************************
[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use).
[W socket.cpp:464] [c10d] The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.2.1', 'console_scripts', 'torchrun')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
result = agent.run()
^^^^^^^^^^^
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
result = self._invoke_run(role)
^^^^^^^^^^^^^^^^^^^^^^
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 862, in _invoke_run
self._initialize_workers(self._worker_group)
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 699, in _initialize_workers
self._rendezvous(worker_group)
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 542, in _rendezvous
store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).
warnings.warn(pytorch_version_warning)
/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/__init__.py:22: UserWarning: !! PyTorch version 2.2.1 found. Upstream issues in PyTorch versions 1.13.* and 2.* have been seen to cause unusual performance degredations on some CUDA systems that become worse over time; see https://github.com/mir-group/nequip/discussions/311. The best tested PyTorch version to use with CUDA devices is 1.11; while using other versions if you observe this problem, an unexpected lack of this problem, or other strange behavior, please post in the linked GitHub issue.
warnings.warn(pytorch_version_warning)
[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:18164 (errno: 98 - Address already in use).
[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:18164 (errno: 98 - Address already in use).
[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:18164 (errno: 98 - Address already in use).
[W socket.cpp:464] [c10d] The server socket has failed to bind to 0.0.0.0:18164 (errno: 98 - Address already in use).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
[W socket.cpp:464] [c10d] The server socket has failed to bind to 0.0.0.0:18164 (errno: 98 - Address already in use).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
[W socket.cpp:464] [c10d] The server socket has failed to bind to 0.0.0.0:18164 (errno: 98 - Address already in use).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/bin/nequip-train", line 33, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/bin/nequip-train", line 33, in <module>
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/bin/nequip-train", line 33, in <module>
sys.exit(load_entry_point('nequip', 'console_scripts', 'nequip-train')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/scripts/train.py", line 79, in main
sys.exit(load_entry_point('nequip', 'console_scripts', 'nequip-train')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/scripts/train.py", line 79, in main
sys.exit(load_entry_point('nequip', 'console_scripts', 'nequip-train')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/scripts/train.py", line 79, in main
_init_distributed(config.distributed)
File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/utils/_global_options.py", line 128, in _init_distributed
_init_distributed(config.distributed)
File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/utils/_global_options.py", line 128, in _init_distributed
_init_distributed(config.distributed)
File "/global/u2/k/kavanase/Packages/multi_gpu_nequip/nequip/nequip/utils/_global_options.py", line 128, in _init_distributed
dist.init_process_group(backend=distributed, timeout=timedelta(hours=2)) # TODO: Should dynamically set this, just for processing part?
dist.init_process_group(backend=distributed, timeout=timedelta(hours=2)) # TODO: Should dynamically set this, just for processing part?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 86, in wrapper
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 86, in wrapper
dist.init_process_group(backend=distributed, timeout=timedelta(hours=2)) # TODO: Should dynamically set this, just for processing part?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 86, in wrapper
func_return = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1177, in init_process_group
func_return = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1177, in init_process_group
func_return = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1177, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
store, rank, world_size = next(rendezvous_iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
store, rank, world_size = next(rendezvous_iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/global/homes/k/kavanase/miniconda3/envs/multi_gpu_nequip/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store
return TCPStore(
^^^^^^^^^
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:18164 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:18164 (errno: 98 - Address already in use).
return TCPStore(
^^^^^^^^^
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:18164 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:18164 (errno: 98 - Address already in use).
return TCPStore(
^^^^^^^^^
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:18164 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:18164 (errno: 98 - Address already in use).
srun: error: nid001089: tasks 0,3: Exited with exit code 1
srun: Terminating StepId=27928164.0
slurmstepd: error: *** STEP 27928164.0 ON nid001089 CANCELLED AT 2024-07-11T00:33:51 ***
srun: error: nid001089: task 1: Exited with exit code 1
srun: error: nid001089: task 2: Terminated
srun: Force Terminated StepId=27928164.0
Final notes for posterity:
processed_data_dir_...
s present (from previous crashed runs); will try fix this in the code in future. gather()
method calls in nequip
(ddp
) as suggested by @JonathanSchmidt1, though I'm not sure if this breaks something else? If @Linux-cpp-lisp has a chance at some point he might be able to comment on thisHi, I honestly forgot most of the issues with the ddp branch and would probably need a few hours of free time to figure out what was going on again, but as mentioned with the horovod branch most the issues went away. I got great scaling even on really outdated nodes (piz daint 1P100 per node). Is it an option for you to use the horovod branch?
This would be my submission script in slurm for horovod:
#!/bin/bash -l
#SBATCH --job-name=test_pt_hvd
#SBATCH --time=02:00:00
##SBATCH --nodes=$1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=12
#SBATCH --constraint=gpu
#SBATCH --account=s1128
#SBATCH --partition=normal
#SBATCH --output=test_pt_hvd_%j.out
module load daint-gpu PyTorch
cd $SLURM_SUBMIT_DIR
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export NCCL_DEBUG=INFO
export NCCL_IB_HCA=ipogif0
export NCCL_IB_CUDA_SUPPORT=1
srun nequip-train ETO_$SLURM_NNODES.yaml
We are interested in training nequip potentials on large datasets of several million structures. Consequently we wanted to know whether multi-gpu support exists or if someone knows whether the networks can be integrated into pytorch lightning. best regards and thank you very much, Jonathan Ps: this might be related to #126