Open Apollo-XI opened 4 years ago
@Apollo-XI Learning rate of 4.0 seems too high. I recommend trying to lower it by 1-2 orders of magnitude.
@tombstone I tried. I used 4.0 because the ssd_mv2_mnas_fpn config use it to train in Coco dataset. As, I saw gradient exploding before having NaN in the error, I lowered LR to 0.4 and 0.04. In 0.4, gradients exploded. In lr=0.04, I could train longer but the net didn't learn anything :/
I used smaller learning rate, but I think the model didn't train in my custom dataset. The first loss is around 2. (previously when I use mobilenetV3 first loss is higher around 30.) and then after 5000 steps the loss is still around 2..
Learning rate setup:
cosine_decay_learning_rate {
learning_rate_base: .001
total_steps: 50000
warmup_learning_rate: .00026666
warmup_steps: 5000
}
it also happened to me when trying to train using this model. The starting loss is arround 2 and it stays in around 2 like forever, then suddenly exploding and the training stopped after few thousands of step. I still can't figure it out why. Really hope to see any updates from the others in this case as well. Cheers,
Prerequisites
Please answer the following questions for yourself before submitting an issue.
1. The entire URL of the file you are using
Model: http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v2_mnasfpn_shared_box_predictor_320x320_coco_sync_2020_05_18.tar.gz
2. Describe the bug
When I load ssd_mobilenet_v2_mnasfpn_coco checkpoint to train in Oxford IIIT Pets dataset, TF throws several errores although train starts. However, model doesn't learn anything and sometimes error goes NaN as reported in #8549. The same errors appear when train script loads the model to perform evaluation.
3. Steps to reproduce
Finetune with Oxford Pets datasets changing coco config to adapt to the new dataset:
Code to reproduce the issue:
4. Expected behavior
Model loads without errors and learns something.
5. Additional context
Install TF Object Detection: https://github.com/tensorflow/models/tree/master/research/object_detection
Message logs: log.txt
6. System information
Config configuration: