tomgoldstein / loss-landscape

Code for visualizing the loss landscape of neural nets
MIT License
2.72k stars 388 forks source link

Bug due to multiple writes ? #4

Open suvojit-0x55aa opened 5 years ago

suvojit-0x55aa commented 5 years ago

I'm encountering this bug when trying to run on 4 GPU system

Traceback (most recent call last):
  File "plot_surface.py", line 291, in <module>
    crunch(surf_file, net, w, s, d, trainloader, 'train_loss', 'train_acc', comm, rank, args)
  File "plot_surface.py", line 82, in crunch
    f = h5py.File(surf_file, 'r+' if rank == 0 else 'r')
  File "/home/ubuntu/.local/lib/python2.7/site-packages/h5py/_hl/files.py", line 312, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/h5py/_hl/files.py", line 144, in make_fid
    fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 78, in h5py.h5f.open
IOError: Unable to open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')

The command I used is:

mpirun -n 4 python plot_surface.py --mpi --cuda --model resnet56 --x=-1:1:51 --y=-1:1:51 \
--model_file cifar10/trained_nets/resnet56_sgd_lr=0.1_bs=128_wd=0.0005/model_300.t7 \
--dir_type weights --xnorm filter --xignore biasbn --ynorm filter --yignore biasbn  --plot

What can be the issue since the code is checking for rank 0 before writing ?

ljk628 commented 5 years ago

Hi Suvojit,

Thanks for reporting. I tested it on my side and it is working correctly. It is possible that the MPI environment or mpi4py was not correctly installed.

I assume you had downloaded the resnet56.tar.gz file and unpacked it to cifar10/trained_nets/. Let's simplify the problem, could you create an empty file surf_file.h5 and run mpirun -n 4 python test_h5py.py? Here test_h5py.py contains:

from mpi4py import MPI
import h5py

print("hdf5_version=" + h5py.version.hdf5_version)

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
    f  = h5py.File('surf_file.h5', 'r+')
    print('rank %d read and write' % rank)
else:
    f = h5py.File('surf_file.h5', 'r')
    print('rank %d read' % rank)

Ideally it should print something as follows and no error is reported:

rank 1 read
rank 2 read
rank 0 read and write
rank 3 read
suvojit-0x55aa commented 5 years ago

Hey, thanks for the reply I tried running this and got this output:

Traceback (most recent call last):
  File "test_h5.py", line 11, in <module>
    f = h5py.File('surf_file.h5', 'r')
  File "/home/ubuntu/.local/lib/python2.7/site-packages/h5py/_hl/files.py", line 312, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/h5py/_hl/files.py", line 142, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 78, in h5py.h5f.open
IOError: Unable to open file (file signature not found)
Traceback (most recent call last):
  File "test_h5.py", line 11, in <module>
    f = h5py.File('surf_file.h5', 'r')
  File "/home/ubuntu/.local/lib/python2.7/site-packages/h5py/_hl/files.py", line 312, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/h5py/_hl/files.py", line 142, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 78, in h5py.h5f.open
IOError: Unable to open file (file signature not found)
Traceback (most recent call last):
  File "test_h5.py", line 11, in <module>
    f = h5py.File('surf_file.h5', 'r')
  File "/home/ubuntu/.local/lib/python2.7/site-packages/h5py/_hl/files.py", line 312, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/h5py/_hl/files.py", line 142, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 78, in h5py.h5f.open
IOError: Unable to open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
rank 0 read and write

So rank 0 is able to access it but other processes can't.

suvojit-0x55aa commented 5 years ago

Ok So I ran it a few times. It runs sometimes, and fails most of the times. Here is the log:

rank 0 read and write
Traceback (most recent call last):
  File "test_h5.py", line 11, in <module>
    f = h5py.File('surf_file.h5', 'r')
  File "/home/ubuntu/.local/lib/python2.7/site-packages/h5py/_hl/files.py", line 312, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/h5py/_hl/files.py", line 142, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 78, in h5py.h5f.open
IOError: Unable to open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
Traceback (most recent call last):
  File "test_h5.py", line 11, in <module>
    f = h5py.File('surf_file.h5', 'r')
  File "/home/ubuntu/.local/lib/python2.7/site-packages/h5py/_hl/files.py", line 312, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/h5py/_hl/files.py", line 142, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 78, in h5py.h5f.open
IOError: Unable to open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
rank 3 read
------------------

Another run:

rank 1 read
rank 3 read
rank 2 read
rank 0 read and write
ljk628 commented 5 years ago

It turns out to be the issue with hdf5 upgrade. You can check the hdf5 version by dpkg -s libhdf5-dev or print (h5py.version.hdf5_version) in the script.

As discussed in https://github.com/h5py/h5py/issues/712, the default hdf5 with Ubuntu 16.04 is 1.8.16 (https://packages.ubuntu.com/search?keywords=libhdf5-dev), but the same code will fail with hdf5 1.10. x. This is a regression that break code that worked in 1.8. While downgrading the hdf5 lib might work, we are working on a solution to deal with this update.

suvojit-0x55aa commented 5 years ago

So I'm using hdf5 v1.8.16 and h5py v2.8.0 but the problem persists. Which version do you suggest ?

ljk628 commented 5 years ago

It seems that the version info printed from dpkg -s libhdf5-dev and print(h5py.version.hdf5_version) are not necessarily consistent. h5py 2.8.0 will print out 1.10.2 and h5py 2.7.0 will print out 1.8.18. So please let me know your version info by print(h5py.version.hdf5_version).

Installing h5py 2.7.0 should solve this problem, i.e., pip install h5py==2.7.0. You can check the installed h5py version by pip list. Note that use pip2 if python 3 is also installed.

Please let me know whether it works out.

suvojit-0x55aa commented 5 years ago

It worked. Thanks for the solution @ljk628

ljk628 commented 5 years ago

Sorry for the confusion! I changed the requirement of h5py to 2.7.0. in https://github.com/tomgoldstein/loss-landscape/commit/75caf64979cc1d6238672d708decc5ecbf9695f9.

ascenoputing commented 5 years ago

It seems that the version info printed from dpkg -s libhdf5-dev and print(h5py.version.hdf5_version) are not necessarily consistent. h5py 2.8.0 will print out 1.10.2 and h5py 2.7.0 will print out 1.8.18. So please let me know your version info by print(h5py.version.hdf5_version).

Installing h5py 2.7.0 should solve this problem, i.e., pip install h5py==2.7.0. You can check the installed h5py version by pip list. Note that use pip2 if python 3 is also installed.

Please let me know whether it works out.

I did as you suggested, but the hdf5_version is still 1.10.1.

image

ascenoputing commented 5 years ago

image

ascenoputing commented 5 years ago

image

KaleabTessera commented 4 years ago

@ascenoputing This PR https://github.com/tomgoldstein/loss-landscape/pull/28 should fix your issue.