tomgoldstein / loss-landscape

Code for visualizing the loss landscape of neural nets
MIT License
2.72k stars 388 forks source link

where is the file .h5? #10

Open ouyangzhuzhu opened 5 years ago

ouyangzhuzhu commented 5 years ago

HI: friends! I have installed all the tools the README.md mentioned and download the ResNet-56 (10 MB) and run this command below: mpirun -n 4 python plot_surface.py --mpi --cuda --model resnet56 --x=-1:1:51 --y=-1:1:51 \ --model_file cifar10/trained_nets/resnet56_sgd_lr=0.1_bs=128_wd=0.0005/model_300.t7 \ --dir_type weights --xnorm filter --xignore biasbn --ynorm filter --yignore biasbn --plot But 24 hoursd later, nothing changed , i cann't finf .h5 file created. Where can i found the .h5 file or did I miss something? Hope u can help~~ 3ks

ljk628 commented 5 years ago

Hi @ouyangzhuzhu,

Thanks for your question. The h5 file should be generated in the same folder as your model file. With that command, there should be two .h5 files in the folder cifar10/trained_nets/resnet56_sgd_lr=0.1_bs=128_wd=0.0005/: model_300.t7_weights_xignore=biasbn_xnorm=filter_yignore=biasbn_ynorm=filter.h5 is the direction file which saves the directions, model_300.t7_weights_xignore=biasbn_xnorm=filter_yignore=biasbn_ynorm=filter.h5_[-1.0,1.0,51]x[-1.0,1.0,51].h5 is the surface file which contains the surface values respect to that direction and resolution.

We have provided our precomputed files. So if you want to generate your own result file, you can delete them or simply use a different resolution.

ouyangzhuzhu commented 5 years ago

great 3ks @ljk628 ! Yes after 4 hours I got the final h5 files just like u said!~ But I got a error at the end, can u help see it: Evaluating rank 0 2600/2601 (100.0%) coord=[1. 1.] train_loss= 17.668 train_acc=8.31 time=5.66 sy nc=0.00 Rank 0 done! Total time: 14505.95 Sync: 2.20 Traceback (most recent call last): File "plot_surface.py", line 298, in <module> plot_2D.plot_2d_contour(surf_file, 'train_loss', args.vmin, args.vmax, args.vlevel, args.show) File "/home/l00221575/Downloads/loss-landscape/plot_2D.py", line 18, in plot_2d_contour f = h5py.File(surf_file, 'r') File "/home/l00221575/venv_openai-es/lib/python3.5/site-packages/h5py/_hl/files.py", line 394, in __init__ swmr=swmr) File "/home/l00221575/venv_openai-es/lib/python3.5/site-packages/h5py/_hl/files.py", line 170, in make_fid fid = h5f.open(name, flags, fapl=fapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5f.pyx", line 85, in h5py.h5f.open OSError: Unable to open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable') And I try to use the comman below to produce and customize a contour plot using the script plot_2D.py: python plot_2D.py --surf_file path_to_surf_file --surf_name train_loss I failed too :( : (venv_openai-es) l00221575@F0817-S05:~/Downloads/loss-landscape$ python plot_2D.py --surf_file cifar10/trained_nets/resnet56_sgd_lr\=0.1_b s\=128_wd\=0.0005/ --surf_name train_loss Traceback (most recent call last): File "plot_2D.py", line 205, in <module> plot_2d_contour(args.surf_file, args.surf_name, args.vmin, args.vmax, args.vlevel, args.show) File "plot_2D.py", line 18, in plot_2d_contour f = h5py.File(surf_file, 'r') File "/home/l00221575/venv_openai-es/lib/python3.5/site-packages/h5py/_hl/files.py", line 394, in __init__ swmr=swmr) File "/home/l00221575/venv_openai-es/lib/python3.5/site-packages/h5py/_hl/files.py", line 170, in make_fid fid = h5f.open(name, flags, fapl=fapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5f.pyx", line 85, in h5py.h5f.open OSError: Unable to open file (file read failed: time = Thu Jan 10 02:44:57 2019 , filename = 'cifar10/trained_nets/resnet56_sgd_lr=0.1_bs=128_wd=0.0005/', file descriptor = 4, errno = 21, error message = 'Is a director y', buf = 0x7ffdce19ddb0, total read size = 8, bytes this sub-read = 8, bytes actually read = 18446744073709551615, offset = 0)

ljk628 commented 5 years ago

This is the same as https://github.com/tomgoldstein/loss-landscape/issues/4, which can be temporally solved by downgrading the h5py pip install h5py==2.7.0.

ouyangzhuzhu commented 5 years ago

great great great 3ks!!!!! it worked!!!!

ouyangzhuzhu commented 5 years ago

hi, @ljk628 I got a error when i try the ResNet-56-noshort (20 MB), the info below is the Traceback . And i delete the "mpirun -n 4 " because maybe something wrong with my mpirun, but it works with slower speed when I try ResNet-56 (10 MB). Please help me ~ great 3ks~~~ (venv_openai-es) l00221575@F0817-S05:~/Downloads/loss-landscape$ python plot_surface.py --mpi --cuda --model resnet56 --x=-1:1:51 --y=-1:1:51 --model_file cifar10/trained_nets/resnet56_noshort_sgd_lr\=0.1_bs\=128_wd\=0.0005/model_300.t7 /home/l00221575/venv_openai-es/lib/python3.5/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype fromfloattonp.floatingis deprecated. In future, it will be treated asnp.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters

[[57286,1],0]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces: Module: OpenFabrics (openib) Host: F0817-S05 Another transport will be used instead, although this may result in lower performance.

Rank 0 use GPU 0 of 8 GPUs on F0817-S05 Traceback (most recent call last): File "plot_surface.py", line 243, in net = model_loader.load(args.dataset, args.model, args.model_file) File "/home/l00221575/Downloads/loss-landscape/model_loader.py", line 6, in load net = cifar10.model_loader.load(model_name, model_file, data_parallel) File "/home/l00221575/Downloads/loss-landscape/cifar10/model_loader.py", line 49, in load net.load_state_dict(stored['state_dict']) File "/home/l00221575/venv_openai-es/lib/python3.5/site-packages/torch/nn/modules/module.py", line 719, in load_state_dict self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for ResNet_cifar: Missing key(s) in state_dict: "layer2.0.shortcut.0.weight", "layer2.0.shortcut.1.running_var", "layer2.0.shortcut.1.bias", "layer2.0.shortcut.1.weight", "layer2.0.shortcut.1.running_mean", "layer3.0.shortcut.0.weight", "layer3.0.shortcut.1.running_var", "layer3.0.shortcut.1.bias", "layer3.0.shortcut.1.weight", "layer3.0.shortcut.1.running_mean".`