google / automl

Google Brain AutoML
Apache License 2.0
6.21k stars 1.45k forks source link

Error while training on custom dataset #425

Closed shreymohan closed 4 years ago

shreymohan commented 4 years ago

I get the following error message while training EffiecientDet-d0:

tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: Incompatible shapes: [12,180,64,64] vs. [12,36,64,64] [[node focal_loss/logistic_loss/GreaterEqual (defined at /home/shrey/work_safety/vehicle_detection/train_val/automl/efficientdet/det_model_fn.py:165) ]] [[strided_slice_3/_8013]] (1) Invalid argument: Incompatible shapes: [12,180,64,64] vs. [12,36,64,64] [[node focal_loss/logistic_loss/GreaterEqual (defined at /home/shrey/work_safety/vehicle_detection/train_val/automl/efficientdet/det_model_fn.py:165) ]] 0 successful operations. 0 derived errors ignored.

@mingxingtan any suggestions? Using latest code and TF version 2.2.0-rc4

mingxingtan commented 4 years ago

Hi @shreymohan looks like your data_format is wrong: one is channels_first, another is channels_last. Could you double check your command lines?

shreymohan commented 4 years ago

Hey! Thank you so much for replying. data_format is set to channels_last in hparams.

This is the cls_output I get from utils.build_model_with_precision method. I am fine-tuning on the UA-Detrac dataset which has 4 classes: cls outputs {3: <tf.Tensor 'class_net/class-predict/BiasAdd:0' shape=(12, 64, 64, 36) dtype=float32>, 4: <tf.Tensor 'class_net/class-predict_1/BiasAdd:0' shape=(12, 32, 32, 36) dtype=float32>, 5: <tf.Tensor 'class_net/class-predict_2/BiasAdd:0' shape=(12, 16, 16, 36) dtype=float32>, 6: <tf.Tensor 'class_net/class-predict_3/BiasAdd:0' shape=(12, 8, 8, 36) dtype=float32>, 7: <tf.Tensor 'class_net/class-predict_4/BiasAdd:0' shape=(12, 4, 4, 36) dtype=float32>}

This class outputs is sent to the detection_loss method in det_model_fn.py file.

This is the cls_output I get while fine-tuning on VOC: cls outputs {3: <tf.Tensor 'class_net/class-predict/BiasAdd:0' shape=(12, 64, 64, 180) dtype=float32>, 4: <tf.Tensor 'class_net/class-predict_1/BiasAdd:0' shape=(12, 32, 32, 180) dtype=float32>, 5: <tf.Tensor 'class_net/class-predict_2/BiasAdd:0' shape=(12, 16, 16, 180) dtype=float32>, 6: <tf.Tensor 'class_net/class-predict_3/BiasAdd:0' shape=(12, 8, 8, 180) dtype=float32>, 7: <tf.Tensor 'class_net/class-predict_4/BiasAdd:0' shape=(12, 4, 4, 180) dtype=float32>}

Any kind of help would be very appreciated. Any thoughts @fsx950223

shreymohan commented 4 years ago

These are the shapes of logits and targets in the focal loss method when the model is initialized:

logits: Tensor("class_net/class-predict/BiasAdd:0", shape=(12, 64, 64, 36), dtype=float32) targets Tensor("Reshape:0", shape=(12, 64, 64, 36), dtype=float32)

logits: Tensor("class_net/class-predict_1/BiasAdd:0", shape=(12, 32, 32, 36), dtype=float32) targets Tensor("Reshape_2:0", shape=(12, 32, 32, 36), dtype=float32)

logits: Tensor("class_net/class-predict_2/BiasAdd:0", shape=(12, 16, 16, 36), dtype=float32) targets Tensor("Reshape_4:0", shape=(12, 16, 16, 36), dtype=float32)

logits: Tensor("class_net/class-predict_3/BiasAdd:0", shape=(12, 8, 8, 36), dtype=float32) targets Tensor("Reshape_6:0", shape=(12, 8, 8, 36), dtype=float32)

logits: Tensor("class_net/class-predict_4/BiasAdd:0", shape=(12, 4, 4, 36), dtype=float32) targets Tensor("Reshape_8:0", shape=(12, 4, 4, 36), dtype=float32)

But when actual data is loaded for training, it gives this error: tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: Incompatible shapes: [12,180,64,64] vs. [12,36,64,64] [[{{node focal_loss/logistic_loss/GreaterEqual}}]] [[strided_slice_3/_8013]] (1) Invalid argument: Incompatible shapes: [12,180,64,64] vs. [12,36,64,64] [[{{node focal_loss/logistic_loss/GreaterEqual}}]]

Maybe there is something wrong in the way I generated tfrecords?

fsx950223 commented 4 years ago

It's a very rudimentary problem, you should use new ckpt path before training new model.

shreymohan commented 4 years ago

@fsx950223 I am using coco ckpt for efficientdet=d0 from this repo .

fsx950223 commented 4 years ago

@fsx950223 I am using coco ckpt for efficientdet=d0 from this repo .

Where does your ckpt store?

shreymohan commented 4 years ago

Thank you so much! I was pointing model_dir to an old path. It starts to train now, thanks again for the cue.