tomgoldstein / loss-landscape

Code for visualizing the loss landscape of neural nets
MIT License
2.79k stars 396 forks source link

Is there any difference about using 'OpenMPI'? #15

Closed seongkyun closed 5 years ago

seongkyun commented 5 years ago

Hello. I just tried to run this code with Ubuntu 16.04 LTS, Geforce TITAN X GPU with Pytorch 0.4.1 While running the code with mpirun -n 4 python plot_surface.py --mpi --cuda --model vgg9 --x=-0.5:1.5:401 --dir_type states --model_file cifar10/trained_nets/vgg9_sgd_lr=0.1_bs=128_wd=0.0_save_epoch=1/model_300.t7 --model_file2 cifar10/trained_nets/vgg9_sgd_lr=0.1_bs=8192_wd=0.0_save_epoch=1/model_300.t7 --plot --show , nothing has happened.

But just delete mpirun -n 4, then code starts running. I think it is in the training process. And after the training process, I can see the plotted results.

Can I run this code without 'OpenMPI'?? I only know that openmpi is just for parallel computation. So can I use the code with python plot_surface.py --mpi --cuda --model vgg9 --x=-0.5:1.5:401 --dir_type states --model_file cifar10/trained_nets/vgg9_sgd_lr=0.1_bs=128_wd=0.0_save_epoch=1/model_300.t7 --model_file2 cifar10/trained_nets/vgg9_sgd_lr=0.1_bs=8192_wd=0.0_save_epoch=1/model_300.t7 --plot --show?

ljk628 commented 5 years ago

@seongkyun, Yes, you can run the code without MPI. Running the code with MPI will split the independent calculation jobs to different GPUs, and the results calculated by each worker are collected to write to disk.

I observed the hanging issue with a newly configured machine, while the same code works well in my machine. Are you able to test the following simple command mpirun -n 4 python test.py? The test.py contains following code

import mpi4py
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank, nproc = comm.Get_rank(), comm.Get_size()
print('rank: %d' % rank)
seongkyun commented 5 years ago

@ljk628 , Thank you for your replying. I've just run your test.py code and the result is below: rank: 0 Is it okay?

And installed requirements are below: pytorch 0.4.1 torchvision 0.2.1 openmpi 3.1.2 mpi4py 2.0.0 numpy 1.12.1 h5py 2.8.0 matplotlib 2.2.3 scipy 1.1.0

ljk628 commented 5 years ago

It should print out four lines if you use mpirun -n 4, .e.g.,

rank: 0
rank: 1
rank: 3
rank: 2

h5py 2.8.0 does not work with this repo. Please check https://github.com/tomgoldstein/loss-landscape/issues/4 and https://github.com/tomgoldstein/loss-landscape/issues/12 for related issues.

seongkyun commented 5 years ago

Thanks. I'll try that