tensorpack / tensorpack

A Neural Net Training Interface on TensorFlow, with focus on speed + flexibility
Apache License 2.0
6.3k stars 1.81k forks source link

problem with restoring IRv2 checkpoint #148

Closed a-maci closed 7 years ago

a-maci commented 7 years ago

Not an issue specifically with tensorpack.

I was trying to load a checkpoint model for Inception-Resnet-v2 (uploaded here). I re-used the tf.slim model for this network mentioned here and use tensorpack as the front-end. Attached are both these file (in text format since I cant upload .py files).

I am running into restoring the graph variables problems. Here is the last portion of the error log:

W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Expected to restore a tensor of type int32, got a tensor of type int64 instead: tensor_name = global_step
         [[Node: 140222649036424/RestoreV2_908 = RestoreV2[dtypes=[DT_INT32], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_140222649036424/Const_0, 140222649036424/RestoreV2_908/tensor_names, 140222649036424/RestoreV2_908/shape_and_slices)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Expected to restore a tensor of type int32, got a tensor of type int64 instead: tensor_name = global_step
         [[Node: 140222649036424/RestoreV2_908 = RestoreV2[dtypes=[DT_INT32], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_140222649036424/Const_0, 140222649036424/RestoreV2_908/tensor_names, 140222649036424/RestoreV2_908/shape_and_slices)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Expected to restore a tensor of type int32, got a tensor of type int64 instead: tensor_name = global_step
         [[Node: 140222649036424/RestoreV2_908 = RestoreV2[dtypes=[DT_INT32], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_140222649036424/Const_0, 140222649036424/RestoreV2_908/tensor_names, 140222649036424/RestoreV2_908/shape_and_slices)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Expected to restore a tensor of type int32, got a tensor of type int64 instead: tensor_name = global_step
         [[Node: 140222649036424/RestoreV2_908 = RestoreV2[dtypes=[DT_INT32], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_140222649036424/Const_0, 140222649036424/RestoreV2_908/tensor_names, 140222649036424/RestoreV2_908/shape_and_slices)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Expected to restore a tensor of type int32, got a tensor of type int64 instead: tensor_name = global_step
         [[Node: 140222649036424/RestoreV2_908 = RestoreV2[dtypes=[DT_INT32], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_140222649036424/Const_0, 140222649036424/RestoreV2_908/tensor_names, 140222649036424/RestoreV2_908/shape_and_slices)]]

Traceback (most recent call last):
  File "tpInception_resnet_v2.py", line 204, in <module>
    SyncMultiGPUTrainer(config).train()
  File "/home/akm/TF/tpNew/tensorpack/tensorpack/train/base.py", line 64, in train
    self.setup()
  File "/home/akm/TF/tpNew/tensorpack/tensorpack/train/base.py", line 146, in setup
    self.config.session_init.init(self.sess)
  File "/home/akm/TF/tpNew/tensorpack/tensorpack/tfutils/sessinit.py", line 35, in init
    self._init(sess)
  File "/home/akm/TF/tpNew/tensorpack/tensorpack/tfutils/sessinit.py", line 82, in _init
    saver.restore(sess, self.path)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1388, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 766, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 964, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Expected to restore a tensor of type int32, got a tensor of type int64 instead: tensor_name = global_step
         [[Node: 140222649036424/RestoreV2_908 = RestoreV2[dtypes=[DT_INT32], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_140222649036424/Const_0, 140222649036424/RestoreV2_908/tensor_names, 140222649036424/RestoreV2_908/shape_and_slices)]]
         [[Node: 140222649036424/RestoreV2_688/_965 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_5427_140222649036424/RestoreV2_688", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

Caused by op u'140222649036424/RestoreV2_908', defined at:
  File "tpInception_resnet_v2.py", line 204, in <module>
    SyncMultiGPUTrainer(config).train()
  File "/home/akm/TF/tpNew/tensorpack/tensorpack/train/base.py", line 64, in train
    self.setup()
  File "/home/akm/TF/tpNew/tensorpack/tensorpack/train/base.py", line 146, in setup
    self.config.session_init.init(self.sess)
  File "/home/akm/TF/tpNew/tensorpack/tensorpack/tfutils/sessinit.py", line 35, in init
    self._init(sess)
  File "/home/akm/TF/tpNew/tensorpack/tensorpack/tfutils/sessinit.py", line 81, in _init
    saver = tf.train.Saver(var_list=dic, name=str(id(dic)), write_version=2)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1000, in __init__
    self.build()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1030, in build
    restore_sequentially=self._restore_sequentially)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 624, in build
    restore_sequentially, reshape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 361, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 200, in restore_op
    [spec.tensor.dtype])[0])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 441, in restore_v2
    dtypes=dtypes, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Expected to restore a tensor of type int32, got a tensor of type int64 instead: tensor_name = global_step
         [[Node: 140222649036424/RestoreV2_908 = RestoreV2[dtypes=[DT_INT32], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_140222649036424/Const_0, 140222649036424/RestoreV2_908/tensor_names, 140222649036424/RestoreV2_908/shape_and_slices)]]
         [[Node: 140222649036424/RestoreV2_688/_965 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_5427_140222649036424/RestoreV2_688", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

Prefetch process exited.
Prefetch process exited.

This error is strange since I dont have a tensor called global_step. I even tried to change the dtype of GLOBAL_STEP_OP_NAME in tfutils/common.py to tf.int64 but I get some other error with it.

Any suggestions/pointer as to what I am doing wrong and/or how to fix this? BTW, I am still using tf version 0.12.0rc1. Not sure if going to 1.0.0 could fix this problem.

run command: python tpInception_resnet_v2.py --gpu 0,1,2,3,4,5,6,7 --data /tank/imagenet-tensorpack-data --load googleModel/inception_resnet_v2_2016_08_30.ckpt

a-maci commented 7 years ago

I updated to TF 1.0.0. Getting a different error this time.

Traceback (most recent call last):
  File "tpInception_resnet_v2.py", line 204, in <module>
    SyncMultiGPUTrainer(config).train()
  File "/home/akm/TF/tpNew/tensorpack/tensorpack/train/base.py", line 64, in train
    self.setup()
  File "/home/akm/TF/tpNew/tensorpack/tensorpack/train/base.py", line 130, in setup
    self._setup()   # subclass will setup the graph
  File "/home/akm/TF/tpNew/tensorpack/tensorpack/train/multigpu.py", line 133, in _setup
    self.config.tower, lambda: self._get_cost_and_grad()[1])
  File "/home/akm/TF/tpNew/tensorpack/tensorpack/train/multigpu.py", line 41, in _multi_tower_grads
    grad_list.append(get_tower_grad_func())
  File "/home/akm/TF/tpNew/tensorpack/tensorpack/train/multigpu.py", line 133, in <lambda>
    self.config.tower, lambda: self._get_cost_and_grad()[1])
  File "/home/akm/TF/tpNew/tensorpack/tensorpack/train/feedfree.py", line 54, in _get_cost_and_grad
    self.build_train_tower()
  File "/home/akm/TF/tpNew/tensorpack/tensorpack/train/feedfree.py", line 43, in build_train_tower
    f()
  File "/home/akm/TF/tpNew/tensorpack/tensorpack/train/feedfree.py", line 36, in f
    self.model.build_graph(inputs)
  File "/home/akm/TF/tpNew/tensorpack/tensorpack/models/model_desc.py", line 113, in build_graph
    self._build_graph(model_inputs)
  File "tpInception_resnet_v2.py", line 53, in _build_graph
    logits, end_points = inception_resnet_v2(image, is_training=is_training)
  File "/home/akm/TF/tpNew/tensorpack/examples/IRv2/inception_resnet_v2.py", line 169, in inception_resnet_v2
    net = tf.concat(3, [tower_conv, tower_conv1_1, tower_conv2_2, tower_pool_1])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 1047, in concat
    dtype=dtypes.int32).get_shape(
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 651, in convert_to_tensor
    as_ref=False)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 716, in internal_convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 176, in _constant_tensor_conversion_function
    return constant(v, dtype=dtype, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 165, in constant
    tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/tensor_util.py", line 367, in make_tensor_proto
    _AssertCompatible(values, dtype)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/tensor_util.py", line 302, in _AssertCompatible
    (dtype.name, repr(mismatch), type(mismatch).__name__))
TypeError: Expected int32, got list containing Tensors of type '_Message' instead.
Prefetch process exited.
Prefetch process exited.

I guess something to do with the concat API. I did minor mod to the file.

ppwwyyxx commented 7 years ago

The second problem is because TF 1.0 changes API of tf.concat. You should swap the argument order of tf.concat in inception_resnet_v2.py.

The first is because the checkpoint contains an unused variable "global_step:0" which happens to conflict with a variable tensorpack defined. In general there is no good way to solve this and it's best to remove unused variables from a checkpoint. But I'll push a change later to make it behave less strict in this case (print a warning instead of crash).

a-maci commented 7 years ago

Thanks. How does one remove a variable from a checkpoint? Simply delete that line or its more involved?

On Feb 14, 2017, at 8:40 PM, Yuxin Wu notifications@github.com wrote:

The second problem is because TF 1.0 changes API of tf.concat. You should swap the argument order of tf.concat in inception_resnet_v2.py.

The first is because the checkpoint contains an unused variable "global_step:0" which happens to conflict with a variable tensorpack defined. In general there is no good way to solve this and it's best to remove unused variables from a checkpoint. But I'll push a change later to make it behave less strict in this case (print a warning instead of crash).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

ppwwyyxx commented 7 years ago

You should be able to load the model with the current HEAD now (don't need to remove the variable).

a-maci commented 7 years ago

I am able to load the checkpoint. I see WRN messages like this for every variable in the graph model:

...
...
Variable InceptionResnetV2/Conv2d_1a_3x3/weights in the graph not found in checkpoint!
Variable InceptionResnetV2/Conv2d_1a_3x3/BatchNorm/beta in the graph not found in checkpoint!
Variable InceptionResnetV2/Conv2d_1a_3x3/BatchNorm/moving_mean in the graph not found in checkpoint!
Variable InceptionResnetV2/Conv2d_1a_3x3/BatchNorm/moving_variance in the graph not found in checkpoint!
Variable InceptionResnetV2/Conv2d_2a_3x3/weights in the graph not found in checkpoint!
Variable InceptionResnetV2/Conv2d_2a_3x3/BatchNorm/beta in the graph not found in checkpoint!
Variable InceptionResnetV2/Conv2d_2a_3x3/BatchNorm/moving_mean in the graph not found in checkpoint!
Variable InceptionResnetV2/Conv2d_2a_3x3/BatchNorm/moving_variance in the graph not found in checkpoint!

...
...

Variable InceptionResnetV2/Repeat_2/block8_9/Branch_1/Conv2d_0b_1x3/BatchNorm/moving_mean in checkpoint not found in the graph!
Variable InceptionResnetV2/Repeat_2/block8_9/Branch_1/Conv2d_0b_1x3/BatchNorm/moving_variance in checkpoint not found in the graph!
Variable InceptionResnetV2/Repeat_2/block8_9/Branch_1/Conv2d_0b_1x3/weights in checkpoint not found in the graph!
Variable InceptionResnetV2/Repeat_2/block8_9/Branch_1/Conv2d_0c_3x1/BatchNorm/beta in checkpoint not found in the graph!
Variable InceptionResnetV2/Repeat_2/block8_9/Branch_1/Conv2d_0c_3x1/BatchNorm/moving_mean in checkpoint not found in the graph!
Variable InceptionResnetV2/Repeat_2/block8_9/Branch_1/Conv2d_0c_3x1/BatchNorm/moving_variance in checkpoint not found in the graph!
Variable InceptionResnetV2/Repeat_2/block8_9/Branch_1/Conv2d_0c_3x1/weights in checkpoint not found in the graph!
Variable InceptionResnetV2/Repeat_2/block8_9/Conv2d_1x1/biases in checkpoint not found in the graph!
Variable InceptionResnetV2/Repeat_2/block8_9/Conv2d_1x1/weights in checkpoint not found in the graph!
Variable global_step in checkpoint not found in the graph!
...

Is this the right behavior? I dont know if the checkpoint variables are being loaded/restored properly. Could you comment?

Attached is the log file till the point the epochs start.

log.log.txt

ppwwyyxx commented 7 years ago

I took at look the checkpoint. It seems to be the very old checkpoint format (before TF 0.8 maybe?) where the variable names don't contain the :0 in the end. So there is a mismatch. I'll fix some code to be compatible with that format.

ppwwyyxx commented 7 years ago

After 6640f9bbaa71bf81ecd9c7042d4ce, it can start training with multi-GPU. You should only see the following variables not found (just some summaries):

[0216 14:52:07 @sessinit.py:110] WRN Variable learning_rate in the graph not found in checkpoint!
[0216 14:52:08 @sessinit.py:110] WRN Variable tower0/train-error-top1/EMA in the graph not found in checkpoint!
[0216 14:52:08 @sessinit.py:110] WRN Variable tower0/cost/EMA in the graph not found in checkpoint!
[0216 14:52:08 @sessinit.py:110] WRN Variable tower0/total_cost/EMA in the graph not found in checkpoint!
[0216 14:52:08 @sessinit.py:110] WRN Variable input_queue_size/EMA in the graph not found in checkpoint!
[0216 14:52:08 @sessinit.py:110] WRN Variable tower0/train-error-top5/EMA in the graph not found in checkpoint!
[0216 14:52:08 @sessinit.py:110] WRN Variable tower0/regularize_loss/EMA in the graph not found in checkpoint!

Loading checkpoints with type checks (and casting) seems to slow it down a lot, because now it loads data from checkpoint to python, then to TF, instead of going to TF directly. I'll do some other tests on the speed.

a-maci commented 7 years ago

Thanks a lot for looking into this and helping out.

On Feb 15, 2017, at 11:15 PM, Yuxin Wu notifications@github.com wrote:

One thing that may cause some training slow-down: google's training code handles the batch norm updates manually, by only using UPDATE_OPS from the first GPU.

With tensorpack models this was done automatically. But with slim models, currently UPDATE_OPS from all GPUs will be executed. This was from PR #81. Now it looks like "applying the UPDATE_OPS blindly" is not always what a user would want. @PatWie for comments.

One possible solution is to use slim.arg_scope to change the updates_collections option of slim.batch_norm. But I'm not familiar with slim so I'm not sure does it work.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

a-maci commented 7 years ago

What is the batch size that you are using and how many GPUs?

ppwwyyxx commented 7 years ago

Inception consumes a lot of memory. I can run it with batch 32 per GPU but not 64, Google also used this number in their recent papers.

a-maci commented 7 years ago

Closing this.

  1. If you could look into why the checkpoint loading is slow that would be helpful. This is a performance aspect not functionality. I am able to get the experiment running.

  2. I had to change the num_classes to 1001 (line 94/95) to get this running. I thought imagenet has 1000 classes. The train folder has 1000 folders. Where is the additional 1 class/folder?

ppwwyyxx commented 7 years ago

https://github.com/tensorflow/models/tree/master/slim#the-resnet-and-vgg-models-have-1000-classes-but-the-imagenet-dataset-has-1001

a-maci commented 7 years ago

@ppwwyyxx Did you continue the training process for this network?

I was re-training, starting from the checkpoint and (1) the process is very slow 0.70it/sec on 8 GPUs and (2) the error rate is quite bad - 0.98 top1 and 0.84 top5 after 2 epochs. The only change to the code I did from the ones I attached last was to change 1000 to 1001 . Learning rate is 1e-6.

Any idea what is going on?

ppwwyyxx commented 7 years ago

No. Have you make sured that the network takes the same input format (rgb/bgr, value range, etc) and output class id? It may not be the same as what's used by the code here.

You'd better eval the model before training on it.

a-maci commented 7 years ago

I went with your advice of doing an eval on the model. The eval is giving me very bad results: Top1 Error: 0.98718, Top5 Error: 0.83158

I dont know what is going wrong and where. I added the BGR to RGB and follow preprocessing items that tf/slim does but its not helping.

Some related links: Preprocessing: https://github.com/tensorflow/models/blob/master/slim/preprocessing/inception_preprocessing.py#L237-L275

Other efforts trying to do eval on IRv2: https://github.com/kentsommer/keras-inception-resnetV2 This dicussion also mentions something is wrong somewhere: https://github.com/kentsommer/keras-inception-resnetV2/issues/1

Could you give this code a try when you get a chance. Thanks

ppwwyyxx commented 7 years ago

As I said they may not use the same class id: https://github.com/tensorflow/models/blob/master/inception/inception/data/build_imagenet_data.py#L119

a-maci commented 7 years ago

I see now what you are saying. What should I do to fix this?

ppwwyyxx commented 7 years ago

You can convert the label either on the dataflow (you can use MapDataComponent) or inside the model (you can use tf.gather or tf.gather_nd).

The class order I'm using is the same as caffe. You can see it at $TENSORPACK_DATASET/ilsvrc_metadata/synset*.txt