vlimant / NNLO

GNU General Public License v3.0
2 stars 7 forks source link

Horovod MPI is not enabled #22

Closed abidmalikwaterloo closed 5 years ago

abidmalikwaterloo commented 5 years ago

I am getting following

Traceback (most recent call last):
  File "TrainingDriver.py", line 294, in <module>
 Initializing Horovod
 Initializing Horovod
    checkpoint=args.checkpoint, checkpoint_interval=args.checkpoint_interval)
  File "/gpfs/alpine/csc343/world-shared/amalik/NNLO/nnlo/mpi/manager.py", line 164, in __init__
Traceback (most recent call last):
  File "TrainingDriver.py", line 294, in <module>
    checkpoint=args.checkpoint, checkpoint_interval=args.checkpoint_interval)
Traceback (most recent call last):
  File "TrainingDriver.py", line 294, in <module>
    checkpoint=args.checkpoint, checkpoint_interval=args.checkpoint_interval)
  File "/gpfs/alpine/csc343/world-shared/amalik/NNLO/nnlo/mpi/manager.py", line 164, in __init__
Traceback (most recent call last):
  File "TrainingDriver.py", line 294, in <module>
    checkpoint=args.checkpoint, checkpoint_interval=args.checkpoint_interval)
 Initializing Horovod
 Initializing Horovod
  File "/gpfs/alpine/csc343/world-shared/amalik/NNLO/nnlo/mpi/manager.py", line 164, in __init__
  File "/gpfs/alpine/csc343/world-shared/amalik/NNLO/nnlo/mpi/manager.py", line 164, in __init__
    self.make_comms(comm)
  File "/gpfs/alpine/csc343/world-shared/amalik/NNLO/nnlo/mpi/manager.py", line 265, in make_comms
    self.make_comms(comm)
    self.make_comms(comm)
  File "/gpfs/alpine/csc343/world-shared/amalik/NNLO/nnlo/mpi/manager.py", line 265, in make_comms
    self.make_comms(comm)
  File "/gpfs/alpine/csc343/world-shared/amalik/NNLO/nnlo/mpi/manager.py", line 265, in make_comms
  File "/gpfs/alpine/csc343/world-shared/amalik/NNLO/nnlo/mpi/manager.py", line 265, in make_comms
    checkpoint=self.checkpoint, checkpoint_interval=self.checkpoint_interval
  File "/gpfs/alpine/csc343/world-shared/amalik/NNLO/nnlo/mpi/process.py", line 495, in __init__
    checkpoint=self.checkpoint, checkpoint_interval=self.checkpoint_interval
  File "/gpfs/alpine/csc343/world-shared/amalik/NNLO/nnlo/mpi/process.py", line 495, in __init__
    checkpoint=self.checkpoint, checkpoint_interval=self.checkpoint_interval
  File "/gpfs/alpine/csc343/world-shared/amalik/NNLO/nnlo/mpi/process.py", line 495, in __init__
    checkpoint=self.checkpoint, checkpoint_interval=self.checkpoint_interval
  File "/gpfs/alpine/csc343/world-shared/amalik/NNLO/nnlo/mpi/process.py", line 495, in __init__
    checkpoint=checkpoint, checkpoint_interval=checkpoint_interval )
  File "/gpfs/alpine/csc343/world-shared/amalik/NNLO/nnlo/mpi/process.py", line 91, in __init__
    checkpoint=checkpoint, checkpoint_interval=checkpoint_interval )
  File "/gpfs/alpine/csc343/world-shared/amalik/NNLO/nnlo/mpi/process.py", line 91, in __init__
    checkpoint=checkpoint, checkpoint_interval=checkpoint_interval )
  File "/gpfs/alpine/csc343/world-shared/amalik/NNLO/nnlo/mpi/process.py", line 91, in __init__
    hvd.init(comm=self.process_comm)
  File "/ccs/home/amalik/.conda/envs/hvdpy27tf/lib/python2.7/site-packages/horovod/common/basics.py", line 46, in init
    checkpoint=checkpoint, checkpoint_interval=checkpoint_interval )
  File "/gpfs/alpine/csc343/world-shared/amalik/NNLO/nnlo/mpi/process.py", line 91, in __init__
    hvd.init(comm=self.process_comm)
  File "/ccs/home/amalik/.conda/envs/hvdpy27tf/lib/python2.7/site-packages/horovod/common/basics.py", line 46, in init
    hvd.init(comm=self.process_comm)
  File "/ccs/home/amalik/.conda/envs/hvdpy27tf/lib/python2.7/site-packages/horovod/common/basics.py", line 46, in init
    hvd.init(comm=self.process_comm)
  File "/ccs/home/amalik/.conda/envs/hvdpy27tf/lib/python2.7/site-packages/horovod/common/basics.py", line 46, in init
    'Horovod MPI is not enabled; Please make sure it\'s installed and enabled.')
    'Horovod MPI is not enabled; Please make sure it\'s installed and enabled.')
ValueError: Horovod MPI is not enabled; Please make sure it's installed and enabled.
    'Horovod MPI is not enabled; Please make sure it\'s installed and enabled.')
ValueError: Horovod MPI is not enabled; Please make sure it's installed and enabled.
    'Horovod MPI is not enabled; Please make sure it\'s installed and enabled.')
ValueError: Horovod MPI is not enabled; Please make sure it's installed and enabled.
ValueError: Horovod MPI is not enabled; Please make sure it's installed and enabled.
0000:00:05.679 M 0:0:- [WARNING] From /ccs/home/amalik/.conda/envs/hvdpy27tf/lib/python2.7/site-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

I found it on https://github.com/horovod/horovod/issues/1383 as well.

Is Horovod working?

I donot have problem with the follwoing:

import mpi4py
import horovod.keras as k

It means both mpi and horovod are working

vlimant commented 5 years ago

please try

from mpi4py import MPI
import horovod.keras as hvd
comm = MPI.COMM_WORLD.Dup()
hvd.init( comm = comm )

the issue comes from the horovod distribution/installation. Either a downgrade or install from the master trunk works, until they cut a new version

abidmalikwaterloo commented 5 years ago

Yes. It works with the instruction given in the link.