microsoft / MMdnn

MMdnn is a set of tools to help users inter-operate among different deep learning frameworks. E.g. model conversion and visualization. Convert models between Caffe, Keras, MXNet, Tensorflow, CNTK, PyTorch Onnx and CoreML.
MIT License
5.8k stars 965 forks source link

Tensorflow frozen pb to Caffe error. KeyError: 'Clip' and AttributeError: min #856

Closed goncz closed 4 years ago

goncz commented 4 years ago

Platform: Ubuntu 16.04 Python version: 3.6.9 Source framework with version: Tensorflow 1.12.2 with Tensorflow GPU 1.12.2

I'm trying to run the Frozen graph conversion example from https://github.com/microsoft/MMdnn/tree/master/mmdnn/conversion/tensorflow However, when i run the command

mmconvert -sf tensorflow -iw mobilenet_v1_1.0_224/frozen_graph.pb --inNodeName input --inputShape 224,224,3 --dstNodeName MobilenetV1/Predictions/Softmax -df caffe -om tf_mobilenet

I get the following errors:

mmconvert -sf tensorflow -iw mobilenet_v1_1.0_224/frozen_graph.pb --inNodeName input --inputShape 224,224,3 --dstNodeName MobilenetV1/Predictions/Softmax -df caffe -om tf_mobilenet
/home/anaconda3/envs/tf_imagerecog_new/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/anaconda3/envs/tf_imagerecog_new/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/anaconda3/envs/tf_imagerecog_new/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/anaconda3/envs/tf_imagerecog_new/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/anaconda3/envs/tf_imagerecog_new/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/anaconda3/envs/tf_imagerecog_new/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
WARNING:tensorflow:From /home/anaconda3/envs/tf_imagerecog_new/lib/python3.6/site-packages/tensorflow/python/tools/strip_unused_lib.py:86: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.compat.v1.graph_util.extract_sub_graph
2020-06-17 11:04:43.769739: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2020-06-17 11:04:43.792569: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
2020-06-17 11:04:43.793201: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55a6066fdd30 executing computations on platform Host. Devices:
2020-06-17 11:04:43.793212: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2020-06-17 11:04:43.911828: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-17 11:04:43.912778: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.62
pciBusID: 0000:01:00.0
totalMemory: 7.76GiB freeMemory: 7.02GiB
2020-06-17 11:04:43.912792: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-06-17 11:04:43.913896: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-17 11:04:43.913904: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2020-06-17 11:04:43.913908: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2020-06-17 11:04:43.914013: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6833 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-06-17 11:04:43.915077: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55a60aec70c0 executing computations on platform CUDA. Devices:
2020-06-17 11:04:43.915088: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5
IR network structure is saved as [601e3728ecf74f1282602df5fc674983.json].
IR network structure is saved as [601e3728ecf74f1282602df5fc674983.pb].
IR weights are saved as [601e3728ecf74f1282602df5fc674983.npy].
Parse file [601e3728ecf74f1282602df5fc674983.pb] with binary format successfully.
Target network code snippet is saved as [601e3728ecf74f1282602df5fc674983.py].
Target weights are saved as [601e3728ecf74f1282602df5fc674983.npy].
Traceback (most recent call last):
  File "/home/anaconda3/envs/tf_imagerecog_new/lib/python3.6/site-packages/caffe/net_spec.py", line 160, in _to_proto
    _param_names[self.type_name] + '_param'), k, v)
KeyError: 'Clip'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/anaconda3/envs/tf_imagerecog_new/bin/mmconvert", line 8, in <module>
    sys.exit(_main())
  File "/home/anaconda3/envs/tf_imagerecog_new/lib/python3.6/site-packages/mmdnn/conversion/_script/convert.py", line 112, in _main
    dump_code(args.dstFramework, network_filename + '.py', temp_filename + '.npy', args.outputModel, args.dump_tag)
  File "/home/anaconda3/envs/tf_imagerecog_new/lib/python3.6/site-packages/mmdnn/conversion/_script/dump_code.py", line 32, in dump_code
    save_model(MainModel, network_filepath, weight_filepath, dump_filepath)
  File "/home/naconda3/envs/tf_imagerecog_new/lib/python3.6/site-packages/mmdnn/conversion/caffe/saver.py", line 9, in save_model
    MainModel.make_net(dump_net)
  File "601e3728ecf74f1282602df5fc674983.py", line 153, in make_net
    print(n.to_proto(), file=fpb)
  File "/home/anaconda3/envs/tf_imagerecog_new/lib/python3.6/site-packages/caffe/net_spec.py", line 193, in to_proto
    top._to_proto(layers, names, autonames)
  File "/home/anaconda3/envs/tf_imagerecog_new/lib/python3.6/site-packages/caffe/net_spec.py", line 97, in _to_proto
    return self.fn._to_proto(layers, names, autonames)
  File "/home/anaconda3/envs/tf_imagerecog_new/lib/python3.6/site-packages/caffe/net_spec.py", line 162, in _to_proto
    assign_proto(layer, k, v)
  File "/home/anaconda3/envs/tf_imagerecog_new/lib/python3.6/site-packages/caffe/net_spec.py", line 64, in assign_proto
    is_repeated_field = hasattr(getattr(proto, name), 'extend')
AttributeError: min
linmajia commented 4 years ago

@goncz , thank you very much for the feedback. It seems that your Caffe does not support the Clip operator, which is used to implement Relu6. Here is a workaround:

  1. Go to the installation directory of MMdnn. E.g. ~/.local/lib/pythonX.Y/site-packages/mmdnn , or /usr/lib/python3/dist-packages/mmdnn
  2. Open file "mmdnn/conversion/caffe/caffe_emitter.py" .
  3. Locate the function "def emit_Relu6(self, IR_node)" (e.g., line 609)
  4. Replace its implementation with the following single line of code: self.emit_Relu(IR_node)
  5. Then, Relu will be used to to simulate Relu6. To be noted, the final learning performance may not be exactly the same to that of the source model.
linmajia commented 4 years ago

Duplicate of #836

goncz commented 4 years ago

@goncz , thank you very much for the feedback. It seems that your Caffe does not support the Clip operator, which is used to implement Relu6. Here is a workaround:

  1. Go to the installation directory of MMdnn. E.g. ~/.local/lib/pythonX.Y/site-packages/mmdnn , or /usr/lib/python3/dist-packages/mmdnn
  2. Open file "mmdnn/conversion/caffe/caffe_emitter.py" .
  3. Locate the function "def emit_Relu6(self, IR_node)" (e.g., line 609)
  4. Replace its implementation with the following single line of code: self.emit_Relu(IR_node)
  5. Then, Relu will be used to to simulate Relu6. To be noted, the final learning performance may not be exactly the same to that of the source model.

Thank you @linmajia, this solved my problem.

goncz commented 4 years ago

However, i now have this problem:

I0618 10:51:23.611192  5265 layer_factory.hpp:77] Creating layer MobilenetV1_MobilenetV1_Conv2d_6_pointwise_BatchNorm_batchnorm_add_scale
I0618 10:51:23.611218  5265 net.cpp:122] Setting up MobilenetV1_MobilenetV1_Conv2d_6_pointwise_BatchNorm_batchnorm_add_scale
I0618 10:51:23.611222  5265 net.cpp:129] Top shape: 1 512 14 14 (100352)
I0618 10:51:23.611225  5265 net.cpp:137] Memory required for data: 68155264
I0618 10:51:23.611229  5265 layer_factory.hpp:77] Creating layer MobilenetV1_MobilenetV1_Conv2d_6_pointwise_Relu6
I0618 10:51:23.611233  5265 net.cpp:84] Creating Layer MobilenetV1_MobilenetV1_Conv2d_6_pointwise_Relu6
I0618 10:51:23.611249  5265 net.cpp:406] MobilenetV1_MobilenetV1_Conv2d_6_pointwise_Relu6 <- MobilenetV1_MobilenetV1_Conv2d_6_pointwise_BatchNorm_batchnorm_add
I0618 10:51:23.611253  5265 net.cpp:367] MobilenetV1_MobilenetV1_Conv2d_6_pointwise_Relu6 -> MobilenetV1_MobilenetV1_Conv2d_6_pointwise_BatchNorm_batchnorm_add (in-place)
I0618 10:51:23.611537  5265 net.cpp:122] Setting up MobilenetV1_MobilenetV1_Conv2d_6_pointwise_Relu6
I0618 10:51:23.611544  5265 net.cpp:129] Top shape: 1 512 14 14 (100352)
I0618 10:51:23.611563  5265 net.cpp:137] Memory required for data: 68556672
I0618 10:51:23.611567  5265 layer_factory.hpp:77] Creating layer MobilenetV1_MobilenetV1_Conv2d_7_depthwise_depthwise
I0618 10:51:23.611572  5265 net.cpp:84] Creating Layer MobilenetV1_MobilenetV1_Conv2d_7_depthwise_depthwise
I0618 10:51:23.611575  5265 net.cpp:406] MobilenetV1_MobilenetV1_Conv2d_7_depthwise_depthwise <- MobilenetV1_MobilenetV1_Conv2d_6_pointwise_BatchNorm_batchnorm_add
I0618 10:51:23.611579  5265 net.cpp:380] MobilenetV1_MobilenetV1_Conv2d_7_depthwise_depthwise -> MobilenetV1_MobilenetV1_Conv2d_7_depthwise_depthwise
F0618 10:51:24.405036  5265 cudnn_conv_layer.cpp:53] Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0)  CUDNN_STATUS_INTERNAL_ERROR
*** Check failure stack trace: ***
Aborted (core dumped)

I have experienced something similar before, i then added the following code before running the tf.Session. In what script should i write this here?

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
linmajia commented 4 years ago

@goncz , since MMdnn invokes the underlying deep learning frameworks, I suggest that you try the CPU mode to avoid out-of-GPU-memory issues by temporarily hiding the GPUs: export CUDA_VISIBLE_DEVICES=" "