CLI loss\accuracy output is displayed incorrectly after a training resume

carlogrisetti commented 5 years ago

Describe the bug While trying to build an image classification model, with a 2-categories output, TensorFlow (through TensorBoard) reports a gradually decreasing loss value (too good to be true, to be honest) and an accuracy approaching 1 over time, however within the Ludwig CLI the loss values are all over the place. I'm talking about <0.01 loss in tensorflow and >300 loss in CLI. This is not happening using a stacked_cnn encoder of any size\shape Input image is sized 300x300x1 (grayscale). I know that's not an optimal input, but that's what I'm working with.

To Reproduce I don't know the specifics of how to repro the issue, since I'm not able to get a "working" example. It's only happening using resnet encoding, and I tried with the simplest one, of size 8 and with all the default values. Here's the YAML input_features:

    name: ImageFile
    type: image
    encoder: resnet
    resnet_size: 8
    num_filters: 16
    kernel_size: 3
    conv_stride: 1

output_features:

    name: Type
    type: category

training: batch_size: 64 learning_rate: 0.0001 epochs: 100 early_stop: 30 decay: true validation_measure: loss

Expected behavior CLI values and TensorFlow Logs values to be at least similar (I understand there are some aggregations performed, but they're in two completely different directions)

Screenshots (orange is batch size 32, blue is batch size 64. Loss in the CLI goes 8,15,20,24 etc while in the graph is <1) (the only save I get is the first one, after that loss value goes away and model never improves)

Environment (please complete the following information):

OS: Windows 10
- Python 3.6.3
- Ludwig 0.1.2

carlogrisetti commented 5 years ago

Oh I forgot to mention I also tried with ResNet size 14, but had the same exact "exploding" result, with no correlation in the TensorFlow Logs. The really weird thing is that loss\accuracy values with ResNet 14 exacly match the ResNet 8 run (orange is resnet-8, blue is not visible due to being exactly superimposed but is resnet-14, and they're both non-real)

w4nderlust commented 5 years ago

Thanks @carlogrisetti will look into it. Does this happen the same if you train on a small subset of the data? In case, could you share that subset of data? Alternatively would you be able to have the same behavior with some publicly available dataset?

carlogrisetti commented 5 years ago

I'd say it happens even with a subset. I was testing this some days ago with 100 or so images, and now bumped it up to 1000+ (just augmented data from the original 100) For now that's all I have available.

I will do some more tests and check for sharing options of the dataset

carlogrisetti commented 5 years ago

Replicated with a simpler 4-category classifier (dog,cat,horse,cow). 10 samples of each, 7 for training, 2 for validation, 1 for test explicited in the .csv using the split column.

Attached you will find the sample dataset (all 300x300 grayscale images) and the yaml used, together with the tensorflow logs up to 200 epochs

CLI loss starts around 30/50 and jumps well over 200 after 100 or so epochs (as expected), but tensorboard reports a steady decrease in loss (around 0.5 at 100 epochs) and an accuracy of 0.92 at 100 epochs, that reaches 1 at 112 epochs.

Ludwig_CLI_100epochs Ludwig_CLI_200epochs Tensorboard_NoSmoothing_200epochs model_definition.zip dataset.zip events.out.zip

carlogrisetti commented 5 years ago

PS: the same exact behavior happens in a tensorflow-gpu environment and in a non-gpu enabled one (of course I couldn't let it go up to 200 epochs, but the first 5 epochs have the same disparity

Ludwig is 0.1.2 from pip (not the latest repo commits, but there are none in this regard), tensorboard, tensorflow and tensorflow-gpu are 1.13.1

If you need further tests just ask, I will look into that if I have time, if you want you can point me in the right direction where CLI output and Tensorflow output is calculated

w4nderlust commented 5 years ago

This is of great help, thank you so much! I confirm I can reproduce, will look into it and figure out the issue much more easily. Stay tuned.

w4nderlust commented 5 years ago

So I solved the issue, but it was kinda tricky, so let me give you a bit of additional explanation so that also other people with similar issues will know a bit better what to do.

The problem was twofold. On one hand the batch norm layer contained in the ResNet had additional update operations to update the running statistics of the batches and those were not called together with the optimize operations. The trickyness of using batchnorm is well documented here. On the other hand, the default parameters of the running averages of the batch norm in TensorFlow's implementation are 0.997 for the momentum and 0.0001 for the epsilon. With those parameters, the running averages get updated really slowly, so the is a big discrepancy between the loss and predictions you get when you use the statistics from the batch (training time) and when you use the running averages (prediction time). This means that even after adding the operations to update those averages, they are updated so slowly that even after hundreds of batches the batch train loss (displayed in the tensorboard) and the evaluation loss (displayed in the tables that Ludwig prints) is still huge as you can see here: Screenshot from 2019-05-10 11-54-15

The solution in this case was to decrease the momentum and increase the epsilon to 0.9 and 0.001 respectively, as suggested here and here.. This made it so that after around 100 batches the loss and accuracy that you get from the tensorboard and the tables is pretty close, as you can see here: Screenshot from 2019-05-10 11-58-49

The values will never be exactly the same because of the nature of batch norm, but I guess these default values are a satisfactory solution. Moreover I added both batch_norm_momentum and batch_norm_epsilon as parameters to the ResNet encoder, so that, given the size of your dataset and your batch_size the user can change those parameters at will, documented here.

Please try using the latest code and let me know if this solves your problem.

carlogrisetti commented 5 years ago

tensorboard displayed logs and ludwig CLI output are now coherent with eachother. I still however do not get correct results when using the model to infer on the same validation data (on which now both the CLI and tensorboard assert 1.0 accuracy). Having 1.0 accuracy on validation should mean that every image used in the validation data should have an exact prediction. I'm not looking at the prediction confidence (which can be <1.0), but the best candidate within the results should be exactly the right one 100% of the time. The results i'm having (both with a resnet-8 and a resnet-34) are, in some cases, wrong with a 1.0 confidence.

Logically thinking, it seems like there's a bug in how the training does the validation, thus converging to weight values that are not the ones needed for a correct prediction. I will try to repro with the same dataset i sent before in this thread and report back

PS: I switched to having 224x224 images in my project

carlogrisetti commented 5 years ago

Ok nevermind. I am able to run the prediction with Ludwig's CLI, but for some strange issue i'm getting different results via the API. This is an entirely different issue, so the one in this thread can stay closed as its behavior has been fixed.

I will open a different one if I can repro this issue.

carlogrisetti commented 5 years ago

Found the issue, I manipulated some images without re-greyscaling them, and cv2 saved them as RGB... nevermind, it all works like a charm now

w4nderlust commented 5 years ago

Oh I see, glad it worked!

carlogrisetti commented 4 years ago

This might have been reintroduced, but only when resuming a training. It all works well, until I stop and resume a training, then the training set loss jumps to 0.9998 or something like that, same near-1 value for the accuracy, and hits at K drop. This is once more misaligned with tensorboard data, which continues on the trend done in the previous training, and does not reflect CLI values. Validation measures seems to be correct, but model does not get further saved due to combined higher loss.

I'll do some more test about this, but it really looks to be the same one under different clothes

w4nderlust commented 4 years ago

Will try to look into it, but the most reasonable explanation is that the buffers in the batch norm are not saved by tensorflow. This means that if you resume training they will need some time to converge to the same value they had before.

carlogrisetti commented 4 years ago

Well... That might be, but an accuracy of 1 with a loss of 1 seems a bit unrealistic... Don't waste time on this unless I'm able to consistently repro. This was just a heads-up in case you had the "a-ha" moment of knowing exactly what to look for :)

carlogrisetti commented 4 years ago

I can repro. Use the same exact dataset, with this model definition model_definition.zip I ran the training up to around 150 epochs, then stopped it and resumed the training. Training loss and accuracy jumped to 1 straight away in the first epoch after resuming, and even 50+ epochs later they didn't adjust further, although the training still continues as it should be (blue is the pre-resume training, orange is a copied model folder on which i did resume)

It happened to me both with resnet8 and with stacked_cnn, I did repro only with stacked_cnn for now

carlogrisetti commented 4 years ago

I can confirm this happens on larger datasets too. It just happened (only when resuming) on a 7.5gb .hdf5 image categorization model with 50k+ images (the structure is basically the same as the one provided in the sample, with 10 output categories).

Can you reopen this issue? It should be easily reproducible with the provided sample.

carlogrisetti commented 4 years ago

And this seems to happen also with non-image inputs, always only when resuming. If I start a new training, it all goes well, way beyond the training step at which the last training was stopped to

w4nderlust commented 4 years ago

Thanks for testing this through. let's reopen the issue and rename it. We'll try to figure out what is the reason for this, my intuition tells me it's about the running averages of batch norm, but will need to investigate further.

carlogrisetti commented 4 years ago

I don't know about that ... And I can't pinpoint to what is the issue behind that. The first epoch that is ran after resuming ends with a loss of 1.000 and accuracy of 1.000 as if some default values are applied and the "to-be-resumed" ones are forgot within the resumed model (json files maybe?) The validation is not immune and gets to values that are really "deterministic" like 0.5000 and 0.7500 (and you know... Too many integer-like numbers are too improbable to occur every time)

Most probably I will have time to try and debug this, since it's my main issue right now. If you want me to test something, just ask.

carlogrisetti commented 4 years ago

Today I tried resuming a training, just to see if with different conditions something changes (it's a different dataset than the other ones on which I am having this issue). This time a really weird thing happened: Although within the CLI, the reported validation loss has increased, the training loop found it to be lower than the previous iterations, and saved the model. This might suggest that the issue in CLI is just related to the "display" of those values and not the real calculation behind them, that leads to the model being saved or not.

Looking at the validation loss through TensorBoard (using my PR that plots it alongside the training loss) I could actually see that there indeed was an improvement in validation loss, and that the "behind-the-scenes" part of the training loop code, is working correctly even if the displayed values are wrong. That might be why it went sort of unnoticed.

It seems the issue is then only constrained to the CLI output itself, not anything else

w4nderlust commented 4 years ago

Interesting, I'd like to dig deeper, but it's difficult without a reproducible example. I checked the code that prints the table during training and can't find anything wrong about it. If you can find a way to reproduce the issue I'll look into it.

carlogrisetti commented 4 years ago

I costantly reproduce this issue. I'm now resuming a training on an image classifier and since resuming all the graphs outputted in the statistics are correct.

This is an entirely new dataset of 224x224 rgb images, and here's the yaml i'm using. model_definition.zip

Ludwig is fresh from master, training on a GPU-enabled machine The strangest thing of them all is that now having all the various epoch graphs output I can confirm that the "under-the-hood" value is being correctly calculated (all the output graphs are correct and the model save when an improvement is found happens correctly), and it's only the graphical CLI output in the tracker that is messed-up.

I'll try to at least find where this is happening. It's purely graphical at this point and it does not impede any training functionality.

carlogrisetti commented 4 years ago

At the end of the resumed training (manually stopped by CTRL+C) the best validation loss on combined is weirdly 0

Looking more in depth at the graphs, something funky can be seen... There are some gaps (might be a 0 represented as a gap by TensorBoard) in the output data... This happens only after resuming (and not before). This is the combined/epoch_validation_loss that should be computed with the same exact data as the CLI tracker

carlogrisetti commented 4 years ago

It seems that the actual loss value for my category output feature is being outputted in the "hits_at_k" field in the table. But I can't really understand why...

Edit: confirming that since the first epoch starting from the resume, the values in the loss go straight into the hits_at_k table cell.

carlogrisetti commented 4 years ago

Ok found the issue... in a new training the progress_tracker.vali_stats (for example) is an OrderedDict having loss, accuracy, hits_at_k values in this specific order. OrderedDict([('TileType', OrderedDict([('loss', [193.31485133495147]), ('accuracy', [0.5048543689320388]), ('hits_at_k', [1.0])])), ('TileStatus', OrderedDict([('loss', [201.5639761026623]), ('accuracy', [0.3786407766990291]), ('hits_at_k', [1.0])])), ('combined', {'loss': [394.87881854899877], 'accuracy': [0.34951456310679613]})])

When resuming a training, however, progress_tracker.vali_stats is not an OrderedDict anymore, and the values are computed in an alphabetical ordering, having accuracy, hits_at_k, loss. {'TileStatus': {'accuracy': [0.3786407766990291, 0.3786407766990291], 'hits_at_k': [1.0, 0.5339805825242718], 'loss': [201.5639761026623, 51.81677342387079]}, 'TileType': {'accuracy': [0.5048543689320388, 0.5048543689320388], 'hits_at_k': [1.0, 1.0], 'loss': [193.31485133495147, 42.41566467285156]}, 'combined': {'accuracy': [0.34951456310679613, 0.34951456310679613], 'loss': [394.87881854899877, 94.23243520792248]}} This screws up the tracker table display, since it simply goes through the for stat in stats[field_name]: and appends them to the table tables[field_name].append(scores) in the wrong order. In the previous comment you can see that the loss column displays the accuracy value, the accuracy column displays the hits_at_k value and the hits_at_k column displays the loss value.

Finally :)

The issue should be found in the way a training resume loads data in an agnostic way, not honoring the original order. Will keep you posted on this...

carlogrisetti commented 4 years ago

This is more evident if you have only one output feature, since you don't have a combined tracker (which is correctly generated since it's manually combined with loss+accuracy always in this order). The issue only presents itself in the actual output features and it's still visible when having multiple output features. You just need to look at the output features themself and not to the combined tracker.

Finally found and fixed this ugly one... pushing a PR now

w4nderlust commented 4 years ago

great catch!!!

ludwig-ai / ludwig

CLI loss\accuracy output is displayed incorrectly after a training resume #328

To Reproduce I don't know the specifics of how to repro the issue, since I'm not able to get a "working" example. It's only happening using resnet encoding, and I tried with the simplest one, of size 8 and with all the default values. Here's the YAML input_features:

output_features: