Network is not learning :(

Rabia-Metis commented 6 years ago

Thanks for the amazing work I am having issue regarding network learning. My NASNet model ain't learning. Training accuracy is improving but validation accuracy isn't changing and stuck at 0.4194. Training data=600 imgs, testing data=62 imgs Image shape= (224,224,3) Epochs=10-15

titu1994 commented 6 years ago

Have you tried a simpler model and seen if it is able to learn on your dataset ? Are the train and validation images from the same datasets ?

Rabia-Metis commented 6 years ago

Yes, I have tried Resnet-152, SEnet and achieved average accuracy above 80% Yes, both are from the same dataset

titu1994 commented 6 years ago

Interesting. What type if preprocessing are you using ? Mean std or -1 to 1? Also, are you using regularization default value? Try setting it to 0

Rabia-Metis commented 6 years ago

I tried with (subtracting mean and dividing by SD) and without pre-processing both. I also tried setting weight_decay=0 as you said but it's not making any difference:

model = NASNet(classes=2,input_shape=(224, 224, 3), weights=None, penultimate_filters=4032, nb_blocks=6,use_auxiliary_branch=True, skip_reduction=False, weight_decay=0)

You can view the logs here https://www.floydhub.com/ptanikon2/projects/n-net/3

titu1994 commented 6 years ago

Why not try fine tuning one of the pre trained NASNet blocks rather than train from scratch? Since you seem to have significant computation available, I suggest using NASNet Mobile as a base and add on the layers to make your final classifier.

titu1994 commented 6 years ago

Also, use the -1 to 1 preprocessing for NASNets, and especially when using pretrained weights for fine-tuning.

Rabia-Metis commented 6 years ago

Okay but weights are not available for NASNET_LARGE_WEIGHT_PATH_WITH_auxiliary_NO_TOP = "https://github.com/titu1994/Keras-NASNet/releases/download/v1.1/NASNet-auxiliary-large-no-top.h5" And when I don't use auxiliary branch and use NASNet-large-no-top.h5 it gives me the following error:

titu1994 commented 6 years ago

Use NASNet Mobile, not large. Weights for all models are available. Refer to Keras blogpost on Fine-tuning to see how to use no-top models for training.

Rabia-Metis commented 6 years ago

I have tried NASNet mobile but its results are also not good (attached). Kindly suggest what else I can try? Secondly there is a typo in auxiliary weights path due to which weights were not loading earlier. Change 'auxiliary' to 'auxilary' in weights path.

maystroh commented 6 years ago

Did you find solution for you problem? I've encountered a similar problem and here is my post on stackoverflow

Rabia-Metis commented 6 years ago

@maystroh No, not yet. Let me know too if you find any solution

titu1994 commented 6 years ago

Have you tried training it with the auxiliary loss for both the mobile and large models ? It is a strong regularizer which is required for training NASNet-A Model from scratch (not as important when just fine-tuning the Dense layers that you add).

Rabia-Metis commented 6 years ago

Yes @titu1994 , I have tried auxiliary branch with both but it's also not making any difference

titu1994 commented 6 years ago

I don't really have an answer then. It's either a problem with the regularizer strength, or with the auxiliary branch not working, or perhaps the model is too large so it's overfitting or something.

The fact that the mobile version doesnt do better either points to some other problem than the above though. Can you give more information of what the dataset is about, number of samples, size of images, task that you are performing (classification or regression or boundary box regression), and more information.

Agent007 commented 6 years ago

@titu1994 I experienced the same issue. You can see the training history in the middle of my Jupyter notebook.

titu1994 commented 6 years ago

@Agent007 A 87 Million parameter model for a dataset of 6600 images. I would say that is cause for concern for any large model, not just NASNet. Use the NASNet Mobile version and see if that reduces the error rate.

With your validation and test set scores being so low compared to training, I think there is something wrong with the data itself.

In cases from this thread, the discrepancy isn't that vast. The reduced performance is a concern, but your case is different.

pGit1 commented 6 years ago

My network learned from data. I used CIFAR 768 model but

It takes 5 minutes per epoch to train on a P600
My results were no where near those reported in the paper. I got about 92.7% on the test set...wonder why there would be such a dramatic difference.

titu1994 commented 6 years ago

I was reading through the paper again. They used a lots of tricks to get such high performance. Just skimming through a few :

Use Cut-out regularization for Cifar
Use DropPath regularization
Use auxiliary branch
Resize CIFAR to 40x40 and then take random crops
Cosine Annealing learning rate (though I think this was only during architecture search)
maybe something else i missed.

I think the two major regularizers are the DropPath and the auxiliary branch. Auxiliary branches are significant regularizers and you don't usually place such a weird branch to a model unless it somewhat significantly impacts the learning process.

Also, Keras in general does not seem to be able to exactly match the performance of PyTorch or base Tensorflow models (even when using the same backend as Tensorflow) for some reason. Obvious reasons are random initialization is different and that's fine, but to have 2-4% difference is a little weird. I try and match the papers almost word for word in Keras as to the decay and initializers and bias and everything, but it always seems to be 2-4% less than what the paper claimed.

Till date, I have never been able to get a basic ResNet 50 to the same level of performance as claimed in the original paper in Keras. TF manages it close enough though (with a 0.035% absolute difference, which is probably due to random initialization).

pGit1 commented 6 years ago

That is weird. Not sure why that would be. In my example I used Cutout and some other augmentation techniques and was that far off. If their mode is that dependent on an annealed schedule and the auxillary branch then I wouldnt say there is anything special about the network it self.

Also what is the topology of an auxillary branch? I read an updated paper that used NasNet and all it talkbed about was Normal and Reduction cells. It would be cool to see a plot_model or something similar to get a feel for what it is actually doing.

Thanks for insights by the way.

pGit1 commented 6 years ago

Having a hard time interpreting this cloud with the dots in the middle of it and I cant find what it means in the paper. As a result I dont understand what the difference between h_sub_i and h_sub_i_minus1 is. Is the cloud performing some type of operation?

Agent007 commented 6 years ago

@pGit1 It's meant to show how multiple layers of Normal Cells should be connected when stacked on top of each other. In this case, there are skip connections from the input of the previous Normal Cell to the convolutions within the current Normal Cell.

pGit1 commented 6 years ago

@Agent007 I am not sure I am following. So h_sub_i is the concatenated output of h_sub_i-1 and the dotted lines represents the concatenated output of h_sub_i-2??

titu1994 commented 6 years ago

H_i is the output of your current cell. H_i-1 is the input to your current cell. H_i-2 is the input to the cell prior to your current cell.

alkari commented 6 years ago

@titu1994 what layer would you freeze from for finetuning imagenet weights with a (200000,224,224,3) dataset in 128 classes?

titu1994 commented 6 years ago

@alkari I'd suggest holding off on using the Keras version of NASNet for fine-tuning purposes for the moment. There have been mumrioke independent reports that suggest the weight loading mechanism was not perfect.

You can try fine-tuning the tensorflow models repository directly, for best results.

alkari commented 6 years ago

Thanks @titu1994, wouldn't that be resolved by freezing deep enough into the network tho? Which is why I was wondering if there's an optimal layer to freeze from. Here's where I have it:

model.trainable = True

set_trainable = False
for layer in model.layers:
  if layer.name == 'activation_253':
    set_trainable = True
  if set_trainable:
    layer.trainable = True
  else:
    layer.trainable = False
  print("layer {} is {}".format(layer.name, '+++trainable' if layer.trainable else '---frozen'))

Current results so far in training without augmentation:

Epoch 48/378 100/100 [==============================] - 538s 5s/step - loss: 2.9667 - predictions_loss: 0.7452 - aux_predictions_loss: 0.1686 - predictions_acc: 0.7764 - aux_predictions_acc: 0.9573 - val_loss: 4.2562 - val_predictions_loss: 1.5506 - val_aux_predictions_loss: 1.3832 - val_predictions_acc: 0.5697 - val_aux_predictions_acc: 0.6290

titu1994 commented 6 years ago

Usually all layers will the last convolutions layer are frozen and then you would add a new classifier to the end. I dunno how, but it's learning something. I'm guessing it's cause even though the weights aren't working, the trainable portion of the network is properly learning something useful.

Try setting a small baseline with a small batchnorm vgg network first

kunwar31 commented 6 years ago

is weight loading completely fixed after the latest commit? @titu1994

titu1994 commented 6 years ago

Yes. Though I haven't tested the auxiliary branches yet, but the weights have been ported. If they still don't work, it's gonna be a problem.

DongfeiJi commented 5 years ago

It seems like overfitting.Try an easy model.

pGit1 commented 3 years ago

@pGit1 It's meant to show how multiple layers of Normal Cells should be connected when stacked on top of each other. In this case, there are skip connections from the input of the previous Normal Cell to the convolutions within the current Normal Cell.

@Agent007 Wait years later your answer is CRYSTAL clear. Not sure why I was over complicating things.

titu1994 / Keras-NASNet

Network is not learning :( #9