vlimant / NNLO

GNU General Public License v3.0
2 stars 7 forks source link

ncclCommInitRank failed #29

Open vlimant opened 5 years ago

vlimant commented 5 years ago

while running

mpirun -x TERM=linux --map-by node --hostfile hostf --prefix /opt/openmpi-3.1.0 -np 3 --tag-output singularity exec --nv -B /imdata/ -B /storage/ /storage/group/gpu/soware/singularity/ibanks/edge.simg python3 TrainingDriver.py --model cifar10_arch.json --train train_cifar10.list --val test_cifar10.list --loss categorical_crossentropy --epochs 1 --n-process 2 --cache /imdata/ --timeline --batch 1000 --trial-name cifar_32

I get

[1,2]:Traceback (most recent call last): [1,2]: File "TrainingDriver.py", line 308, in [1,2]: checkpoint=args.checkpoint, checkpoint_interval=args.checkpoint_interval) [1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/manager.py", line 165, in init [1,2]: self.make_comms(comm) [1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/manager.py", line 266, in make_comms [1,2]: checkpoint=self.checkpoint, checkpoint_interval=self.checkpoint_interval [1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 493, in init [1,2]: checkpoint=checkpoint, checkpoint_interval=checkpoint_interval ) [1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 121, in init [1,2]: self.train() [1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 557, in train [1,2]: train_metrics = self.model.train_on_batch( x=batch[0], y=batch[1] ) [1,2]: File "/nfshome/vlimant/NNLO/nnlo/train/model.py", line 19, in wrapper [1,2]: return f(args, kwargs) [1,2]: File "/nfshome/vlimant/NNLO/nnlo/train/model.py", line 143, in train_on_batch [1,2]: return np.asarray(self.model.train_on_batch( args )) [1,2]: File "/usr/local/lib/python3.6/site-packages/keras/engine/training.py", line 1217, in train_on_batch [1,2]: outputs = self.train_function(ins) [1,2]: File "/usr/local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in call [1,2]: return self._call(inputs) [1,2]: File "/usr/local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call [1,2]: fetched = self._callable_fn(array_vals) [1,2]: File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1458, in call [1,2]: run_metadata_ptr) [1,2]:tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error [1,2]: [[{{node training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_dense_3_BiasAdd_grad_BiasAddGrad_0}}]]