Unable to reproduce figure 10 of the paper as loss diverges to NaN

smonsays commented 4 years ago

I am trying to reproduce the result reported in figure 10 of the paper, i.e. run the discrete permuted MNIST experiment with 10 tasks and 20 epochs per task. From the paper I inferred the following parametrization:

python main.py --logname discrete_permuted_mnist_10tasks_20epochs --nn_arch mnist_simple_net_200width_domainlearning_784input_10cls_1ds --test_freq 10 --seed 2019 --permute_seed 2019 --dataset ds_permuted_mnist --num_epochs $(( 20 * 10 )) --optimizer bgd --std_init 0.06 --batch_size 256 --results_dir 20epochs --train_mc_iters 10 --inference_mc --test_mc_iters 10

Unfortunately, the loss diverges to NaN for this configuration. Setting the seed to 2020 yields the same result.

Can you help me to reproduce the results shown in figure 10?

igolan commented 4 years ago

Thank you for raising this issue. The configuration of that experiment is as in Zenke et. al. ( Continual Learning Through Synaptic Intelligence ), which is two hidden layers of width 2000, and not 200. For some reason, this architecture was missing from this repository, so I added it in commit 275cab5dfd653d45863ba86fd547dec8d3e4f272.

You can run the experiment by cloning the last commit, and running:

python main.py --logname discrete_permuted_mnist_10tasks_20epochs --nn_arch mnist_simple_net_2000width_domainlearning_784input_10cls_1ds --test_freq 10 --seed 2019 --permute_seed 2019 --dataset ds_permuted_mnist --num_epochs $(( 20 * 10 )) --optimizer bgd --std_init 0.015 --batch_size 256 --results_dir 20epochs --train_mc_iters 10 --inference_map

The command above has the following changes from yours: I changed nn_arch to the architecture of width 2000, changed std_init to match that architecture (0.015), and changed the inference method to map (instead of Monte Carlo samples) which is faster.

smonsays commented 4 years ago

The configuration of that experiment is as in Zenke et. al. ( Continual Learning Through Synaptic Intelligence ), which is two hidden layers of width 2000, and not 200.

Oh I somehow missed that part but it is clearly stated in your paper. With the changed architecture and given hyperparameters I was able to reproduce your result (even a bit better for that particular seed).

Thank you for your quick response!

igolan / bgd

Unable to reproduce figure 10 of the paper as loss diverges to NaN #2