gkdziugaite / pacbayes-opt

Optimizing PAC-Bayes bounds for Stochastic Neural Networks with Gaussian weights
Apache License 2.0
26 stars 7 forks source link

PAC-Bayes bound sometimes returns `nan` value #5

Open KaroliShp opened 2 years ago

KaroliShp commented 2 years ago

Reproduction following README: I can provide the pickle files if needed. I encountered this several times by varying number of pacb_epochs, for example 350. I did not alter the code on the master branch in any way.

python3.5 snn/experiments/run_sgd.py fc --layers 600 --sgd_epochs 20 --binary
python3.5 snn/experiments/run_pacb.py fc --layers 600 --sgd_epochs 20 --pacb_epochs 30 --lr 0.01 --drop_lr 15 --lr_factor 0.1 --binary

Output:

...
PAC bound error: nan Gen bound : nan KL value:  143854.7812

Reason: jdown and jup values are negative, even though these are supposed to be natural numbers:

After discretization
-1.0712925464970227 -16.0
-1.0762925464970228 -15.0

Then clearly np.log(jdisc_down) = nan and np.log(jdisc_up) = nan, giving nan as the PAC-Bayes bound.

EDIT: I believe this is still the same issue as in https://github.com/gkdziugaite/pacbayes-opt/issues/3

EDIT 2: To be clear, I have the exact same environment as required by the README

KaroliShp commented 2 years ago

Will be adding more information to this comment as I am debugging it right now.


Final output of optimize_PACB:

Epoch:0030 cost=0.13071768 mean accuracy 0.9954 KL div:  767.3750 A term: 0.0464 B term: 0.0843 Bquad: 781.7835 log_prior_std: -1.0725 B PAC: 0.0267 factor1: 9.2103 factor2: -9.2103

As you can see factor1 + factor2 ~= 0, which, if I understand correctly, means that the term 2\log(j) ~= 0, implying that j ~= 1. However, once we enter evaluate_SNN_accuracy, we see from my original post output that j = -15.

Note that factor2=-9.2103 because 2*\log(1e-2) = -9.2103. This comes fromtf.maximum(..,1e-2). Looking back at issue #3, this value is hardcoded to avoid nan values in training. It now makes sense why we observe negative j, confirming these bugs are connected and issue #3 was not fixed.


Looks to me like the problem is turning constrained optimization into unconstrained. Clearly, when using 1/2*log(\lambda), \lambda can become greater than c (and it does), which makes the 2log(j) term undefined and this problem is not solved by simply using tf.maximum as is done at the currently latest commit.


An alternative way to fix this problem instead of tf.maximum is by using logistic and logit funtions to optimize over instead, which should fix the problem of incorrectly turning the constrained optimization into unconstrained. I can provide details for this later.