OpRegularizerManager could not handle ops

mengdong commented 5 years ago

Hello,

I have tried a few examples from tensorflow/model with morphnet (lenet and resnet), a simple mnist model (https://github.com/mengdong/morph-net/blob/master/morph_net/examples/mnist/mnist-tutorial.py) works. However, I ran into problems in some other more complex models under tensorflow estimator interface.

I wonder is there a recommended way to use morphnet in tf estimator inferface? I know there is quite some overhead in the estimator's graph. Detailed infromation below:

Regarding lenet (https://github.com/mengdong/morph-net/blob/master/morph_net/examples/mnist/mnist.py) from https://github.com/tensorflow/models/tree/master/official/mnist, I observe that:

    I0904 13:14:27.240477 140031449261888 op_regularizer_manager.py:125] 
    OpRegularizerManager found 63 ops and 4 sources.
    ......
    File "/home/dongm/workspace/laptop_mapping/morph-net/morph_net/framework/op_regularizer_manager.py", line 137, in __init__
    ['%s (%s)' % (o.name, o.type) for o in self._op_deque])
    RuntimeError: OpRegularizerManager could not handle ops: ['sequential/conv2d/BiasAdd (BiasAdd)', 'sequential/max_pooling2d_1/MaxPool (MaxPool)', 'sequential/conv2d_1/BiasAdd (BiasAdd)', 'sequential/max_pooling2d/MaxPool (MaxPool)', 'sequential/conv2d/BiasAdd/ReadVariableOp (ReadVariableOp)']

Regarding ResNet (https://github.com/mengdong/morph-net/blob/master/morph_net/examples/resnet/imagenet_main.py), I observe:

    I0904 11:27:34.397989 139699288442688 op_regularizer_manager.py:125] 
    OpRegularizerManager found 629 ops and 53 sources.
    .....
    RuntimeError: OpRegularizerManager could not handle ops: 
    ['resnet_model/batch_normalization_45/FusedBatchNormV3 (FusedBatchNormV3)', 
    'resnet_model/Pad_6 (Pad)', 'resnet_model/batch_normalization_44/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_49/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_48/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_47/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_52/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_51/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/Squeeze (Squeeze)', 'resnet_model/final_reduce_mean (Identity)', 'resnet_model/Mean (Mean)', 'resnet_model/batch_normalization_50/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_43/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_24/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_11/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_1/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/Pad (Pad)', 
    'resnet_model/batch_normalization/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_4/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_3/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/max_pooling2d/MaxPool (MaxPool)', 'resnet_model/initial_max_pool (Identity)', 'resnet_model/batch_normalization_2/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_7/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_6/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_5/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_10/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_9/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_8/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_14/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_13/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/Pad_2 (Pad)', 'resnet_model/batch_normalization_12/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_17/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_16/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_15/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_20/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_19/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_18/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_23/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_22/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_21/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_27/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_26/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/Pad_4 (Pad)', 
    'resnet_model/batch_normalization_25/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_30/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_29/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_28/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_33/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_32/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_31/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_36/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_35/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_34/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_39/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_38/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_37/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_42/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_41/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_40/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_46/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_45/ReadVariableOp (ReadVariableOp)', 'resnet_model/batch_normalization_45/ReadVariableOp_1 (ReadVariableOp)']

eladeban commented 5 years ago

@ayp-google

@mengdong Thanks for rising this issue.

We routinely work with ResNet and other complicated models so I don't think that the complexity is the issue.

Are you using the encoding where channels is in dim=3?

Could you recover this behavior with a single ResNet unit and print the entire trace?

mengdong commented 5 years ago

Hello @eladeban,

Thanks for the prompt response. I don't think complexity is the issue, as lenet also have similar error. I suspect the additional node/ops created by tensorflow estimator interface. I will try with tensorflow slim to see how it works seems like you more success on tensorflow slim.

eladeban commented 5 years ago

took another look. it might be reduce_mean in line 542. can you apply the regularizer to take inputs prior to that?

qq: are you using channels_first?

mengdong commented 5 years ago

Sorry for the late reply, yes, I am using channels_first. Let me modify the regularizer and give it a try

mengdong commented 5 years ago

Hello, thank you for looking into the code. I have tried to modify the output_boundary to:

name: "resnet_model/block_layer4"
op: "Identity"
input: "resnet_model/Relu_48"
device: "/replica:0/task:0/device:GPU:0"
attr {
  key: "T"
  value {
    type: DT_FLOAT
  }
}

The entire trace is here:

I0910 17:29:56.384171 139780389660480 op_regularizer_manager.py:122] OpRegularizerManager starting analysis from: [<tf.Operation 'resnet_model/block_layer4' type=Identity>].
I0910 17:29:56.385807 139780389660480 op_regularizer_manager.py:125] OpRegularizerManager found 618 ops and 53 sources.
Traceback (most recent call last):
  File "imagenet_main.py", line 391, in <module>
    absl_app.run(main)
  File "/home/dongm/python-virtual-env/tftot/lib/python3.6/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/home/dongm/python-virtual-env/tftot/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "imagenet_main.py", line 385, in main
    run_imagenet(flags.FLAGS)
  File "imagenet_main.py", line 378, in run_imagenet
    shape=[DEFAULT_IMAGE_SIZE, DEFAULT_IMAGE_SIZE, NUM_CHANNELS])
  File "/home/dongm/workspace/laptop_mapping/morph-net/morph_net/examples/resnet/resnet_run_loop.py", line 705, in resnet_main
    max_steps=flags_obj.max_train_steps)
  File "/home/dongm/python-virtual-env/tftot/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/dongm/python-virtual-env/tftot/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1156, in _train_model
    return self._train_model_distributed(input_fn, hooks, saving_listeners)
  File "/home/dongm/python-virtual-env/tftot/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1219, in _train_model_distributed
    self._config._train_distribute, input_fn, hooks, saving_listeners)
  File "/home/dongm/python-virtual-env/tftot/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1299, in _actual_train_model_distributed
    self.config))
  File "/home/dongm/python-virtual-env/tftot/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1810, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/home/dongm/python-virtual-env/tftot/lib/python3.6/site-packages/tensorflow_core/python/distribute/one_device_strategy.py", line 356, in _call_for_each_replica
    return fn(*args, **kwargs)
  File "/home/dongm/python-virtual-env/tftot/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1146, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "imagenet_main.py", line 347, in imagenet_model_fn
    label_smoothing=flags.FLAGS.label_smoothing
  File "/home/dongm/workspace/laptop_mapping/morph-net/morph_net/examples/resnet/resnet_run_loop.py", line 398, in resnet_model_fn
    gamma_threshold=1e-3
  File "/home/dongm/workspace/laptop_mapping/morph-net/morph_net/network_regularizers/flop_regularizer.py", line 72, in __init__
    regularizer_blacklist=regularizer_blacklist)
  File "/home/dongm/workspace/laptop_mapping/morph-net/morph_net/framework/op_regularizer_manager.py", line 137, in __init__
    ['%s (%s)' % (o.name, o.type) for o in self._op_deque])
RuntimeError: OpRegularizerManager could not handle ops: ['resnet_model/batch_normalization_31/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_36/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_35/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_34/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_39/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_38/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_37/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_42/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_41/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_40/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_46/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_45/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/Pad_6 (Pad)', 'resnet_model/batch_normalization_44/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_49/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_48/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_47/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_52/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_51/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_50/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_43/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_24/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_11/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_1/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/Pad (Pad)', 'resnet_model/batch_normalization/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_4/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_3/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/max_pooling2d/MaxPool (MaxPool)', 'resnet_model/initial_max_pool (Identity)', 'resnet_model/batch_normalization_2/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_7/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_6/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_5/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_10/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_9/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_8/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_14/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_13/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/Pad_2 (Pad)', 'resnet_model/batch_normalization_12/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_17/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_16/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_15/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_20/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_19/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_18/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_23/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_22/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_21/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_27/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_26/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/Pad_4 (Pad)', 'resnet_model/batch_normalization_25/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_30/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_29/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_28/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_33/FusedBatchNormV3 (FusedBatchNormV3)', 'resnet_model/batch_normalization_32/FusedBatchNormV3 (FusedBatchNormV3)']

eladeban commented 5 years ago

channels_first is the problem. We assume channels_last... Note that you need to use channels_last only during structure leanring, later you could revert back to (faster?) channel_first.

mengdong commented 5 years ago

I see. Let try this again. Thanks for clarifying.

google-research / morph-net

OpRegularizerManager could not handle ops #111