lincc-frameworks / fibad

FIBAD - Framework for Image-Based Anomaly Detection
MIT License
4 stars 0 forks source link

Training isn't doing what we expect #101

Closed drewoldag closed 32 minutes ago

drewoldag commented 3 hours ago

Max reported at the KBMOD meeting yesterday that while training the CNN for filtering KBMOD true/false results, that training for <=3 epochs resulted in a trained model that always predicted "true", while training >=4 epochs resulted in a model that always predicted "false".

I started looking into it using the built in example CNN model and CIFAR datasets. I trained a model on the standard CIFAR10 dataset for 10 epochs and the resulting model almost always predicts 3 (or 10 possible classes). I used the CIFAR 10 test set that has 1000 images for each class, and the result distribution looks like this:

Prediction classes: Counter({3: 9988, 5: 12})
Known classes: Counter({3: 1000, 8: 1000, 0: 1000, 6: 1000, 1: 1000, 9: 1000, 5: 1000, 7: 1000, 4: 1000, 2: 1000})

There is clearly something wrong with how training is working.

drewoldag commented 3 hours ago

Just repeated my experiment running over 50 epochs, and got very similar results. So my hunch is that we're either:

drewoldag commented 1 hour ago

It seems like the solution is to explicitly state model.train(), or for our built in models, self.train() in the training_step method.

My hunch is that the models might have been in training mode, but only had were only adjusting the weights over one epoch.

One noticeable difference is that the epochs take a little longer now. Before using mac metal and a batch size of 32, the epoch ran in about 13 seconds, now it's closer to 20, presumably because it's doing more under the hood.