tensorflow / models

Models and examples built with TensorFlow
Other
77.05k stars 45.77k forks source link

Quantized SSD-MobileNet Checkpoints Missing Min/Max? #4783

Closed parvizp closed 6 years ago

parvizp commented 6 years ago

System information

Describe the problem

@achowdhery Are the "quantized" models in the zoo checkpoint's supposed to contain the FakeQuant (Min/Max)?

I tried ssd_mobilenet_v1_0.75_depth_quantized_coco and ssd_mobilenet_v1_quantized_coco.

I used the command from the tutorial to export a quantized TF-Lite model.

Source code / logs

parvizp@cent-nano-0:~/Git/tensorflow.new/models/research$ CUDA_VISIBLE_DEVICES=-1 python object_detection/export_tflite_ssd_graph.py --pipeline_config_path=object_detection/graphs/ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync_2018_07_03/pipeline.config --trained_checkpoint_prefix=object_detection/graphs/ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync_2018_07_03/model.ckpt --output_directory=/tmp/ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync_2018_07_03/  --add_postprocessing_op=true
2018-07-16 10:47:11.378546: E tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2018-07-16 10:47:11.378601: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: cent-nano-0
2018-07-16 10:47:11.378610: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: cent-nano-0
2018-07-16 10:47:11.378641: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: 396.26.0
2018-07-16 10:47:11.378670: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 396.26.0
2018-07-16 10:47:11.378678: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:305] kernel version seems to match DSO: 396.26.0
2018-07-16 10:47:13.255871: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key BoxPredictor_0/BoxEncodingPredictor/act_quant/max not found in checkpoint
Traceback (most recent call last):
  File "object_detection/export_tflite_ssd_graph.py", line 137, in <module>
    tf.app.run(main)
  File "/home/parvizp/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "object_detection/export_tflite_ssd_graph.py", line 133, in main
    FLAGS.max_classes_per_detection)
  File "/home/parvizp/Git/tensorflow.new/models/research/object_detection/export_tflite_ssd_graph_lib.py", line 261, in export_tflite_graph
    initializer_nodes='')
  File "/home/parvizp/Git/tensorflow.new/models/research/object_detection/exporter.py", line 72, in freeze_graph_with_def_protos
    saver.restore(sess, input_checkpoint)
  File "/home/parvizp/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1743, in restore
    err, "a Variable name or other graph key that is missing")
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key BoxPredictor_0/BoxEncodingPredictor/act_quant/max not found in checkpoint
         [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Caused by op u'save/RestoreV2', defined at:
  File "object_detection/export_tflite_ssd_graph.py", line 137, in <module>
    tf.app.run(main)
  File "/home/parvizp/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "object_detection/export_tflite_ssd_graph.py", line 133, in main
    FLAGS.max_classes_per_detection)
  File "/home/parvizp/Git/tensorflow.new/models/research/object_detection/export_tflite_ssd_graph_lib.py", line 261, in export_tflite_graph
    initializer_nodes='')
  File "/home/parvizp/Git/tensorflow.new/models/research/object_detection/exporter.py", line 67, in freeze_graph_with_def_protos
    tf.import_graph_def(input_graph_def, name='')
  File "/home/parvizp/.local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
    return func(*args, **kwargs)
  File "/home/parvizp/.local/lib/python2.7/site-packages/tensorflow/python/framework/importer.py", line 442, in import_graph_def
    _ProcessNewOps(graph)
  File "/home/parvizp/.local/lib/python2.7/site-packages/tensorflow/python/framework/importer.py", line 234, in _ProcessNewOps
    for new_op in graph._add_new_tf_operations(compute_devices=False):  # pylint: disable=protected-access
  File "/home/parvizp/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3360, in _add_new_tf_operations
    for c_op in c_api_util.new_tf_operations(self)
  File "/home/parvizp/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3251, in _create_op_from_tf_operation
    ret = Operation(c_op, self)
  File "/home/parvizp/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1716, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key BoxPredictor_0/BoxEncodingPredictor/act_quant/max not found in checkpoint
         [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
achowdhery commented 6 years ago

@parvizp Have you tried the checkpoint in the tutorial also copied here? https://storage.googleapis.com/download.tensorflow.org/models/tflite/ssd_mobilenet_v1_0.75_depth_300x300_quant_pets_2018_06_29.zip

We often get this error when we have not done the export_tflite_ssd_graph.py - double checking you have already passed the checkpoint through that to get the frozen graph. I can double check the ones in the model zoo

parvizp commented 6 years ago

@achowdhery Thanks, I just tried your URL and the export succeeds.

achowdhery commented 6 years ago

Thanks. I have verified the models are converting from model zoo as well.

melody-rain commented 6 years ago

@achowdhery I got the similar errors when I converted models from https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md

NotFoundError (see above for traceback): Key BoxPredictor_0/BoxEncodingPredictor/act_quant/max not found in checkpoint

The model I am trying to convert is

ssd_mobilenet_v1_quantized_300x300_coco14_sync_2018_07_03
achowdhery commented 6 years ago

Please give exact instructions to reproduce. Need to make sure we see same issue

melody-rain commented 6 years ago

@achowdhery I followed your blog https://medium.com/tensorflow/training-and-serving-a-realtime-mobile-object-detector-in-30-minutes-with-cloud-tpus-b78971cf1193. The only difference is the model file I tried to export.

Export the model with:

python object_detection/export_tflite_ssd_graph.py \
--pipeline_config_path object_detection/samples/configs/ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync.config \
--trained_checkpoint_prefix ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync_2018_07_03/model.ckpt \
--output_directory ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync_2018_07_03/tflite \
--add_postprocessing_op true \
achowdhery commented 6 years ago

If you start with this checkpoint, does it work: https://storage.googleapis.com/download.tensorflow.org/models/tflite/ssd_mobilenet_v1_0.75_depth_300x300_quant_pets_2018_06_29.zip

junshanlee commented 6 years ago

Hi, I also met the same problem as melody-rain did. The checkpoint of https://storage.googleapis.com/download.tensorflow.org/models/tflite/ssd_mobilenet_v1_0.75_depth_300x300_quant_pets_2018_06_29.zip is ok. But failed by starting with checkpoint from model zoo.

achowdhery commented 6 years ago

Thanks. The models have been updated in the model zoo now: https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md

achowdhery commented 6 years ago

The models have been updated in the model zoo now https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md

RichardLiee commented 6 years ago

@achowdhery hi,I wanna to export ssdlite_mobilenetv2, and I meet the same issue, Tensor name "BoxPredictor_0/BoxEncodingPredictor/biases" not found in checkpoint files

achowdhery commented 6 years ago

@RichardLiee What checkpoint are you using? Please provide a link.

RichardLiee commented 6 years ago

hello, here: http://download.tensorflow.org/models/object_detection/ssdlite_mobilenet_v2_coco_2018_05_09.tar.gz

averdones commented 6 years ago

@RichardLiee have you checked this?

oopsodd commented 5 years ago

hi @achowdhery , I tried to train a quantized model for mobile devices. But when I converted the model to tflite, I got this:

tensorflow/lite/toco/tooling_util.cc:1694] Array FeatureExtractor/MobilenetV1/MobilenetV1/Squeeze_excitation_Conv2d_3_depthwise/mul, which is an input to the Conv operator producing the output array FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Relu6, is lacking min/max data, which is necessary for quantization.
If accuracy matters, either target a non-quantized output format, or run quantized training with your model from a floating point checkpoint to change the input graph to contain min/max information.
If you don't care about accuracy, you can pass --default_ranges_min= and --default_ranges_max= for easy experimentation.
Aborted (core dumped)

Please help deal with this, did the training process need more configs?

When I added --default_ranges_min=0 --default_ranges_max=6, the tflite accuracy drop so bad. But it works for some cases (decrease the accuracy a bit).

NorwayLobster commented 5 years ago

i am facing the exact same problem as yours @oopsodd, can someone give some hint or solution to solve this problem?

oopsodd commented 5 years ago

I didn't solve the problem. (default_ranges_min=0, default_ranges_max=6) option works for some specific image sizes input of the same network.

NorwayLobster commented 5 years ago

thank you @oopsodd for your. "dummy quantization" does not work well in performance as you said

samuel1208 commented 5 years ago

I meet the same problem when train my own quantized model. How to fix it

raj-shah14 commented 5 years ago

Key BoxPredictor_0/BoxEncodingPredictor/act_quant/max not found in checkpoint Faced this issue when tried to train with http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v2_quantized_300x300_coco_2018_09_14.tar.gz

chrissaher commented 5 years ago

I fixed that problem by using the downloaded model as a pretrained model. Then, in the configuration file for trained, this must be added to the end before retraining: graph_rewriter { quantization { delay: 48000 weight_bits: 8 activation_bits: 8 } } By doing this, the model will be prepared for future quantization and export necessary information in checkpoint.

raj-shah14 commented 5 years ago

I fixed that problem by using the downloaded model as a pretrained model. Then, in the configuration file for trained, this must be added to the end before retraining: graph_rewriter { quantization { delay: 48000 weight_bits: 8 activation_bits: 8 } } By doing this, the model will be prepared for future quantization and export necessary information in checkpoint.

@chrissaher I tried what you suggested but I still get the same error.

chrissaher commented 5 years ago

Can you please provide the configuration file you are using for training?

raj-shah14 commented 5 years ago

Can you please provide the configuration file you are using for training?

@chrissaher https://github.com/tensorflow/models/blob/master/research/object_detection/samples/configs/ssd_mobilenet_v2_quantized_300x300_coco.config

This is the Config file I have used and configured proper paths in PATH to CONFIGURED

raj-shah14 commented 5 years ago

@chrissaher When I downloaded this http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v2_quantized_300x300_coco_2018_09_14.tar.gz model from the model zoo, it doesn't have a checkpoint file in it.

chrissaher commented 5 years ago

@raj-shah14 I successfully transformed that model to tflite using the following script: tflite_convert \ --output_file="object_detection/zoo/ssd_mobilenet_v2_quantized_300x300_coco_2018_09_14/model.tflite" \ --graph_def_file="object_detection/zoo/ssd_mobilenet_v2_quantized_300x300_coco_2018_09_14/tflite_graph.pb" \ --inference_type=QUANTIZED_UINT8 \ --input_arrays="normalized_input_image_tensor" \ --output_arrays="TFLite_Detection_PostProcess","TFLite_Detection_PostProcess:1","TFLite_Detection_PostProcess:2","TFLite_Detection_PostProcess:3" \ --mean_values=128 \ --std_dev_values=128 \ --input_shapes=1,300,300,3 \ --change_concat_input_ranges=false \ --allow_nudging_weights_to_use_fast_gemm_kernel=true \ --allow_custom_ops Please modify the path to your files correctly.

raj-shah14 commented 5 years ago

@chrissaher Thanks for your reply. I also was able to do this, but this is post training quantization and affects the accuracy too much. I was trying to do quantization aware training. It would be great if you could guide me with that.

da2r-20 commented 5 years ago

Hey @raj-shah14 I have the same issue.

@chrissaher adding graph_rewriter { quantization { delay: 48000 weight_bits: 8 activation_bits: 8 } } didn't work for me too.

I'm trying to train ssd_mobilenet_v2_quantized_300x300_coco using the legacy train.py and then freeze the checkpoint. It fails on trying to freeze it.

When I train with --num_clones=1 the freeze succeeds but with --num_clones=4 it fails.

Did anyone solve it?

holyhao commented 5 years ago

@oopsodd @NorwayLobster I meet the same issue when train a quantized model and try to covert it to tf.lite. Do you have some idea about this.

achowdhery commented 5 years ago

Did you try using the export script https://github.com/tensorflow/models/blob/master/research/object_detection/export_tflite_ssd_graph.py instead?

holyhao commented 5 years ago

@achowdhery yeah, ssd model is fine for me. But when i use shared architecture, ppn. I train it with quantization . When use toco, the wrong information as following: Array WeightSharedConvolutionalBoxPredictor/PredictionTower/conv2d_0/Conv2D, which is an input to the Mul operator producing the output array WeightSharedConvolutionalBoxPredictor/Relu6, is lacking min/max data, which is necessary for quantization.

roadcode commented 5 years ago

@doronAtuar i also met the same problem, and i look into the saved checkpoint, find that there are something wrong in it, there are some node like

clone_1/FeatureExtractor/MobilenetV2/expanded_conv_6/expand/act_quant/clone_1/FeatureExtractor/MobilenetV2/expanded_conv_6/expand/act_quant/max/biased

but actually it should be

FeatureExtractor/MobilenetV2/expanded_conv_6/expand/act_quant/max/biased

i fix this by rewrite the node name

git-hamza commented 4 years ago

@chrissaher Thanks for your reply. I also was able to do this, but this is post training quantization and affects the accuracy too much. I was trying to do quantization aware training. It would be great if you could guide me with that.

Did you find help on quantization aware training? I a trying to use ssd_mobilenet_v1_quantized_coco as my pretrained model for training but i get error. but when i use ssd_mobilenet_v1_coco its working but the loss is not converging and training is too slow.

zheyangshi commented 4 years ago

I met the problem when I tried to train mobileV3(quantization aware), TF version:1.15.2, ubuntu 18.04 image

donald2016 commented 4 years ago

@doronAtuar i also met the same problem, and i look into the saved checkpoint, find that there are something wrong in it, there are some node like

clone_1/FeatureExtractor/MobilenetV2/expanded_conv_6/expand/act_quant/clone_1/FeatureExtractor/MobilenetV2/expanded_conv_6/expand/act_quant/max/biased

but actually it should be

FeatureExtractor/MobilenetV2/expanded_conv_6/expand/act_quant/max/biased

i fix this by rewrite the node name

I am facing the same problem when I used multi training. How did you rewrite the node?

Thanks you

parameswaraRao-13 commented 3 years ago

plz add following lines in config files at end:

graph_rewriter { quantization { delay: 48000 weight_bits: 8 activation_bits: 8 } }

these lines represent quantization aware training in ssd models. then it can automatically set those min max values in graph while training.

alelotti96 commented 3 years ago

@parameswaraRao-13 this actually didn't solve the problem, I'm still facing the same issue while converting the ssd_mobilenet_v2_mnasfpn_coco model from the model zoo (after qat) with export_tflite_ssd_graph.py. Any news from anybody?