train_loss is not found

ilkarman commented 4 years ago

I run the plot_surface code like so:

    /usr/bin/python -u /local/mnt/workspace/ikarmano/Gitlab/sagd/loss-landscape/plot_surface.py --cuda \
    --x=-1:1:51 --y=-1:1:51 --model_file models/32_32_32_32_32_32_32_32_32_32_32_32_32_32_32cnn.t \
    --dir_type weights --xnorm filter --xignore biasbn --ynorm filter --yignore biasbn --plot

And it seem to calculate the loss fine:

Evaluating rank 2  90/2601  (3.5%)  coord=[ 0.56 -0.96]     train_loss= 21.470  train_acc=14.54     time=5.28   sync=0.00
Evaluating rank 2  91/2601  (3.5%)  coord=[ 0.6  -0.96]     train_loss= 22.225  train_acc=14.10     time=5.65   sync=0.00
Evaluating rank 2  92/2601  (3.5%)  coord=[ 0.64 -0.96]     train_loss= 23.044  train_acc=13.67     time=5.92   sync=0.00
Evaluating rank 2  93/2601  (3.6%)  coord=[ 0.68 -0.96]     train_loss= 23.935  train_acc=13.33     time=5.71   sync=0.00
Evaluating rank 2  94/2601  (3.6%)  coord=[ 0.72 -0.96]     train_loss= 24.905  train_acc=13.02     time=5.65   sync=0.00
Evaluating rank 2  95/2601  (3.7%)  coord=[ 0.76 -0.96]     train_loss= 25.958  train_acc=12.66     time=5.50   sync=0.00
Evaluating rank 2  96/2601  (3.7%)  coord=[ 0.8  -0.96]     train_loss= 27.100  train_acc=12.37     time=5.99   sync=0.00
Evaluating rank 2  97/2601  (3.7%)  coord=[ 0.84 -0.96]     train_loss= 28.334  train_acc=12.13     time=5.85   sync=0.00
Evaluating rank 2  98/2601  (3.8%)  coord=[ 0.88 -0.96]     train_loss= 29.666  train_acc=11.91     time=5.71   sync=0.00
Evaluating rank 2  99/2601  (3.8%)  coord=[ 0.92 -0.96]     train_loss= 31.101  train_acc=11.69     time=5.58   sync=0.00

However, the plot functions do not work because 'train_loss' is not found:

train_loss is not found in ../models/32_32_32_32_32_32_32_32_32_32_32_32_32_32_32cnn.t_weights_xignore=biasbn_xnorm=filter_yignore=biasbn_ynorm=filter.h5_[-1.0,1.0,51]x[-1.0,1.0,51].h5

And if I print the keys(), it's just:

<KeysViewHDF5 ['dir_file', 'xcoordinates', 'ycoordinates']>

Not sure what I'm doing wrong?

ljk628 commented 4 years ago

Hi @ilkarman, our code saves the surface values by the rank 0 process in default after collecting values calculated by multiple processes, as you can see in https://github.com/tomgoldstein/loss-landscape/blob/master/plot_surface.py#L88 and https://github.com/tomgoldstein/loss-landscape/blob/master/plot_surface.py#L136.

It seems that you are not using mpi and your process rank value is 2, so it might be the reason why the surface values are not saved into the h5 file. It could be an easy fix if you change the default rank values to 2 or figure out why it is not zero.

ilkarman commented 4 years ago

Thank you! One of the params for crunch() was being overwritten instead of rank.

tomgoldstein / loss-landscape

train_loss is not found #26