BatchNormalization: validation loss >> train loss, same data

OverLordGoldDragon commented 5 years ago

I implemented this paper's neural net, with some differences (img below), for EEG classification; train_on_batch performance is excellent, with very low loss - but test_on_batch performance, though on same data, is poor: the net seems to always predict '1', most of the time:

       TRAIN (loss,acc)   VAL (loss,acc)
'0' -- (0.06269842,1)     (3.7652588,0)
'1' -- (0.04473557,1)     (0.3251827,1)

Data is fed as 30-sec segments (12000 timesteps) (10 mins per dataset) from 32 (= batch_size) datasets at once (img below)

Any remedy?

Troubleshooting attempted:

Disabling dropout
Disabling all regularizers (except batch-norm)
Randomly, val_acc('0','1') = (~.90, ~.12) - then back to (0,1)

Additional details:

Keras 2.2.4 (TensorFlow backend), Python 3.6, Spyder 3.3.4 via Anaconda
CuDNN LSTM stateful
CNNs pretrained, LSTMs added on afterwards (and both trained)
BatchNormalization after every CNN & LSTM layer
reset_states() applied between different datasets
squeeze_excite_block inserted after every but last CNN block

UPDATE: Progress was made; batch_normalization and dropout are the major culprits. Major changes:

Removed LSTMs, GaussianNoise, SqueezeExcite blocks (img below)
Implemented batch_norm patch
Added sample_weights to reflect class imbalance - varied between 0.75 and 2.
Trained with various warmup schemes for both MaxPool and Input dropouts

Considerable improvements were observed - but not nearly total. Train vs. validation loss behavior is truly bizarre - flipping class predictions, and bombing the exact same datasets it had just trained on:

Also, BatchNormalization outputs during train vs. test time differ considerably (img below)

UPDATE 2: All other suspicions were ruled out: BatchNormalization is the culprit. Using Self-Normalizing Networks (SNNs) with SELU & AlphaDropout in place of BatchNormalization yields stable and consistent results.

OverLordGoldDragon commented 5 years ago

BatchNormalization outputs with a more dramatic val vs. train difference:

One layer zoomed:

OverLordGoldDragon commented 5 years ago

Probably resolved: I've inadvertently left in one non-standardized sample (in a batch of 32), with sigma=52 - which severely disrupted the BN layers; post-standardizing, I no longer observe a strong discrepancy between train & inference modes - if anything, any differences are difficult to spot.

keras-team / keras

BatchNormalization: validation loss >> train loss, same data #12851