Open RomStriker opened 3 years ago
The issue rises from line 446 in model_lib.py
load_pretrained = hparams.load_pretrained if hparams else False;
because one of the previous commits changed hparams to None as mention in #8695 so load_pretrained is always False. Setting it to True, and reinstalling the object_detection library fixes the problem. However, as mentioned in #8695 there should be a better fix for this.
@RomStriker I run into a somewhat similar issue while finetuning from available checkpoints downloaded form Model Zoo. Since the model zoo provides a trained checkpoint in a checkpoint
directory, passing this directory gives an error. Training script throws warnings that checkpoint directory is ignored and tried to create a checkpoint.tmpXXXXX directory and error out.
I'm guessing it's also because of this reason, if it's still not fixed. Any thoughts.
However the training starts fine from scratch if I don't give the checkpoint
directory.
People, myself included, trust TensorFlow V1 over V2. I am so close to switching libraries. Any suggestions? Does PyTorch work well with Horovod or scale easily to multi-GPU training?
Prerequisites
Please answer the following questions for yourself before submitting an issue.
1. The entire URL of the file you are using
https://github.com/tensorflow/models/blob/master/research/object_detection/model_main.py
2. Describe the bug
I am fine-tuning SSD-MobileNetV3 Large and SSD-MobileDet-CPU on the COCO 2017 dataset but with only book classes. I have created a new dataset for this and inspected the dataset and it is good. I have also modified the config file to my needs. When I start the training, it just ignores the fine_tune_checkpoint provided in the config file and starts from scratch. However, if I do the same process but with the checkpoint in the model_dir directory instead, it tries to restore it but since I have different number of classes, it gives an error. How can make the training process restore the checkpoint properly? I also tried with normal COCO dataset with all 90 classes, and when I start the training, fine_tune_checkpoint is ignore, but if I put the checkpoint in the model_dir, it is restored properly.
3. Steps to reproduce
Config file:
4. Expected behavior
N/A
5. Additional context
N/A
6. System information