Closed a-maci closed 7 years ago
I updated to TF 1.0.0. Getting a different error this time.
Traceback (most recent call last):
File "tpInception_resnet_v2.py", line 204, in <module>
SyncMultiGPUTrainer(config).train()
File "/home/akm/TF/tpNew/tensorpack/tensorpack/train/base.py", line 64, in train
self.setup()
File "/home/akm/TF/tpNew/tensorpack/tensorpack/train/base.py", line 130, in setup
self._setup() # subclass will setup the graph
File "/home/akm/TF/tpNew/tensorpack/tensorpack/train/multigpu.py", line 133, in _setup
self.config.tower, lambda: self._get_cost_and_grad()[1])
File "/home/akm/TF/tpNew/tensorpack/tensorpack/train/multigpu.py", line 41, in _multi_tower_grads
grad_list.append(get_tower_grad_func())
File "/home/akm/TF/tpNew/tensorpack/tensorpack/train/multigpu.py", line 133, in <lambda>
self.config.tower, lambda: self._get_cost_and_grad()[1])
File "/home/akm/TF/tpNew/tensorpack/tensorpack/train/feedfree.py", line 54, in _get_cost_and_grad
self.build_train_tower()
File "/home/akm/TF/tpNew/tensorpack/tensorpack/train/feedfree.py", line 43, in build_train_tower
f()
File "/home/akm/TF/tpNew/tensorpack/tensorpack/train/feedfree.py", line 36, in f
self.model.build_graph(inputs)
File "/home/akm/TF/tpNew/tensorpack/tensorpack/models/model_desc.py", line 113, in build_graph
self._build_graph(model_inputs)
File "tpInception_resnet_v2.py", line 53, in _build_graph
logits, end_points = inception_resnet_v2(image, is_training=is_training)
File "/home/akm/TF/tpNew/tensorpack/examples/IRv2/inception_resnet_v2.py", line 169, in inception_resnet_v2
net = tf.concat(3, [tower_conv, tower_conv1_1, tower_conv2_2, tower_pool_1])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 1047, in concat
dtype=dtypes.int32).get_shape(
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 651, in convert_to_tensor
as_ref=False)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 716, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 176, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 165, in constant
tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/tensor_util.py", line 367, in make_tensor_proto
_AssertCompatible(values, dtype)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/tensor_util.py", line 302, in _AssertCompatible
(dtype.name, repr(mismatch), type(mismatch).__name__))
TypeError: Expected int32, got list containing Tensors of type '_Message' instead.
Prefetch process exited.
Prefetch process exited.
I guess something to do with the concat API. I did minor mod to the file.
The second problem is because TF 1.0 changes API of tf.concat. You should swap the argument order of tf.concat
in inception_resnet_v2.py
.
The first is because the checkpoint contains an unused variable "global_step:0" which happens to conflict with a variable tensorpack defined. In general there is no good way to solve this and it's best to remove unused variables from a checkpoint. But I'll push a change later to make it behave less strict in this case (print a warning instead of crash).
Thanks. How does one remove a variable from a checkpoint? Simply delete that line or its more involved?
On Feb 14, 2017, at 8:40 PM, Yuxin Wu notifications@github.com wrote:
The second problem is because TF 1.0 changes API of tf.concat. You should swap the argument order of tf.concat in inception_resnet_v2.py.
The first is because the checkpoint contains an unused variable "global_step:0" which happens to conflict with a variable tensorpack defined. In general there is no good way to solve this and it's best to remove unused variables from a checkpoint. But I'll push a change later to make it behave less strict in this case (print a warning instead of crash).
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
You should be able to load the model with the current HEAD now (don't need to remove the variable).
I am able to load the checkpoint. I see WRN messages like this for every variable in the graph model:
...
...
Variable InceptionResnetV2/Conv2d_1a_3x3/weights in the graph not found in checkpoint!
Variable InceptionResnetV2/Conv2d_1a_3x3/BatchNorm/beta in the graph not found in checkpoint!
Variable InceptionResnetV2/Conv2d_1a_3x3/BatchNorm/moving_mean in the graph not found in checkpoint!
Variable InceptionResnetV2/Conv2d_1a_3x3/BatchNorm/moving_variance in the graph not found in checkpoint!
Variable InceptionResnetV2/Conv2d_2a_3x3/weights in the graph not found in checkpoint!
Variable InceptionResnetV2/Conv2d_2a_3x3/BatchNorm/beta in the graph not found in checkpoint!
Variable InceptionResnetV2/Conv2d_2a_3x3/BatchNorm/moving_mean in the graph not found in checkpoint!
Variable InceptionResnetV2/Conv2d_2a_3x3/BatchNorm/moving_variance in the graph not found in checkpoint!
...
...
Variable InceptionResnetV2/Repeat_2/block8_9/Branch_1/Conv2d_0b_1x3/BatchNorm/moving_mean in checkpoint not found in the graph!
Variable InceptionResnetV2/Repeat_2/block8_9/Branch_1/Conv2d_0b_1x3/BatchNorm/moving_variance in checkpoint not found in the graph!
Variable InceptionResnetV2/Repeat_2/block8_9/Branch_1/Conv2d_0b_1x3/weights in checkpoint not found in the graph!
Variable InceptionResnetV2/Repeat_2/block8_9/Branch_1/Conv2d_0c_3x1/BatchNorm/beta in checkpoint not found in the graph!
Variable InceptionResnetV2/Repeat_2/block8_9/Branch_1/Conv2d_0c_3x1/BatchNorm/moving_mean in checkpoint not found in the graph!
Variable InceptionResnetV2/Repeat_2/block8_9/Branch_1/Conv2d_0c_3x1/BatchNorm/moving_variance in checkpoint not found in the graph!
Variable InceptionResnetV2/Repeat_2/block8_9/Branch_1/Conv2d_0c_3x1/weights in checkpoint not found in the graph!
Variable InceptionResnetV2/Repeat_2/block8_9/Conv2d_1x1/biases in checkpoint not found in the graph!
Variable InceptionResnetV2/Repeat_2/block8_9/Conv2d_1x1/weights in checkpoint not found in the graph!
Variable global_step in checkpoint not found in the graph!
...
Is this the right behavior? I dont know if the checkpoint variables are being loaded/restored properly. Could you comment?
Attached is the log file till the point the epochs start.
I took at look the checkpoint. It seems to be the very old checkpoint format (before TF 0.8 maybe?) where the variable names don't contain the :0
in the end. So there is a mismatch.
I'll fix some code to be compatible with that format.
After 6640f9bbaa71bf81ecd9c7042d4ce, it can start training with multi-GPU. You should only see the following variables not found (just some summaries):
[0216 14:52:07 @sessinit.py:110] WRN Variable learning_rate in the graph not found in checkpoint!
[0216 14:52:08 @sessinit.py:110] WRN Variable tower0/train-error-top1/EMA in the graph not found in checkpoint!
[0216 14:52:08 @sessinit.py:110] WRN Variable tower0/cost/EMA in the graph not found in checkpoint!
[0216 14:52:08 @sessinit.py:110] WRN Variable tower0/total_cost/EMA in the graph not found in checkpoint!
[0216 14:52:08 @sessinit.py:110] WRN Variable input_queue_size/EMA in the graph not found in checkpoint!
[0216 14:52:08 @sessinit.py:110] WRN Variable tower0/train-error-top5/EMA in the graph not found in checkpoint!
[0216 14:52:08 @sessinit.py:110] WRN Variable tower0/regularize_loss/EMA in the graph not found in checkpoint!
Loading checkpoints with type checks (and casting) seems to slow it down a lot, because now it loads data from checkpoint to python, then to TF, instead of going to TF directly. I'll do some other tests on the speed.
Thanks a lot for looking into this and helping out.
On Feb 15, 2017, at 11:15 PM, Yuxin Wu notifications@github.com wrote:
One thing that may cause some training slow-down: google's training code handles the batch norm updates manually, by only using UPDATE_OPS from the first GPU.
With tensorpack models this was done automatically. But with slim models, currently UPDATE_OPS from all GPUs will be executed. This was from PR #81. Now it looks like "applying the UPDATE_OPS blindly" is not always what a user would want. @PatWie for comments.
One possible solution is to use slim.arg_scope to change the updates_collections option of slim.batch_norm. But I'm not familiar with slim so I'm not sure does it work.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
What is the batch size that you are using and how many GPUs?
Inception consumes a lot of memory. I can run it with batch 32 per GPU but not 64, Google also used this number in their recent papers.
Closing this.
If you could look into why the checkpoint loading is slow that would be helpful. This is a performance aspect not functionality. I am able to get the experiment running.
I had to change the num_classes to 1001 (line 94/95) to get this running. I thought imagenet has 1000 classes. The train folder has 1000 folders. Where is the additional 1 class/folder?
@ppwwyyxx Did you continue the training process for this network?
I was re-training, starting from the checkpoint and (1) the process is very slow 0.70it/sec on 8 GPUs and (2) the error rate is quite bad - 0.98 top1 and 0.84 top5 after 2 epochs. The only change to the code I did from the ones I attached last was to change 1000 to 1001 . Learning rate is 1e-6.
Any idea what is going on?
No. Have you make sured that the network takes the same input format (rgb/bgr, value range, etc) and output class id? It may not be the same as what's used by the code here.
You'd better eval the model before training on it.
I went with your advice of doing an eval on the model. The eval is giving me very bad results: Top1 Error: 0.98718, Top5 Error: 0.83158
I dont know what is going wrong and where. I added the BGR to RGB and follow preprocessing items that tf/slim does but its not helping.
Some related links: Preprocessing: https://github.com/tensorflow/models/blob/master/slim/preprocessing/inception_preprocessing.py#L237-L275
Other efforts trying to do eval on IRv2: https://github.com/kentsommer/keras-inception-resnetV2 This dicussion also mentions something is wrong somewhere: https://github.com/kentsommer/keras-inception-resnetV2/issues/1
Could you give this code a try when you get a chance. Thanks
As I said they may not use the same class id: https://github.com/tensorflow/models/blob/master/inception/inception/data/build_imagenet_data.py#L119
I see now what you are saying. What should I do to fix this?
You can convert the label either on the dataflow (you can use MapDataComponent) or inside the model (you can use tf.gather or tf.gather_nd).
The class order I'm using is the same as caffe. You can see it at $TENSORPACK_DATASET/ilsvrc_metadata/synset*.txt
Not an issue specifically with tensorpack.
I was trying to load a checkpoint model for Inception-Resnet-v2 (uploaded here). I re-used the tf.slim model for this network mentioned here and use tensorpack as the front-end. Attached are both these file (in text format since I cant upload .py files).
I am running into restoring the graph variables problems. Here is the last portion of the error log:
This error is strange since I dont have a tensor called global_step. I even tried to change the dtype of GLOBAL_STEP_OP_NAME in tfutils/common.py to tf.int64 but I get some other error with it.
Any suggestions/pointer as to what I am doing wrong and/or how to fix this? BTW, I am still using tf version 0.12.0rc1. Not sure if going to 1.0.0 could fix this problem.
run command: python tpInception_resnet_v2.py --gpu 0,1,2,3,4,5,6,7 --data /tank/imagenet-tensorpack-data --load googleModel/inception_resnet_v2_2016_08_30.ckpt