Closed kcheng999 closed 7 years ago
There are some variable reuse problems with multiple GPU support, I have fixed them before, but the performance doesn't increase (even decrease with multiple GPU) so I just leave it like this. I think the better choice is to use single GPU for computing. If you really want to use multiple GPU, you need to fix the variable reuse problem first and then figure out the performance problem. Check my fork for this project, maybe this could help you better understand the project. Cheers. @kcheng999
I think when you mention weight decay you are talking about this. Sorry for the confusing word performance
, what I mean here is the training speed not the accuracy. The training speed is decreasing when I apply multiple GPU. @kcheng999
Thank you. But I am still confused about the speed. For example, When I use single GPU, I set batchsize=20. If I use 2 GPUs, I can process 40 clips together. In theory, the speed is twice, but in fact the speed is about 1.8 times than single GPU. Do you mean that the speed is slower than twice the speed of single GPU? @frankgu
Split the GPU with batch size is feasible, however, we have to think about how to merge the training result together, for example, when GPU 1 is training 20 batches data and our mode is being modified, while GPU 2 is also changing our model. How can we handle this conflict?
Refer to this structure: https://www.tensorflow.org/images/Parallelism.png I think your code is the same with this structure. I have read the code https://github.com/hx173149/C3D-tensorflow and I have not found any mistakes. I wonder about the variable reuse problems you mentioned before, because I didn't find them. Could you please tell me how the variable reuse problem occurs in that code? @frankgu
I will try to fix the reuse problem in my fork and notify you when I am done. Cheers. @kcheng999
have a try @kcheng999 fork
Thank you so much! I will try you code on my datasets. @frankgu
I test your new code, and find a restore bug.
In your code, you writeconv3 = tf.concat((pool2, conv3), 4)
,so here the conv3 is 384 channels and conv4 is 768 channels. But in the sports1M pretrained model, the conv3 is 256 channels and conv4 is 512 channels. They don't match.
Therefore, I get an error:
InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [768] rhs shape= [512] [[Node: save_1/Assign_4 = Assign[T=DT_FLOAT, _class=["loc:@c3d_var/conv4/biases_a"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/cpu:0"](c3d_var/conv4/biases_a, save_1/RestoreV2_4)]]
So could you please tell me how you use the pretrain model? @frankgu
I delete the concatenate operation, and run your code. But I found another bug. @frankgu The code works well when gpu_num =1 or 2. But when I set gpu_num>2, I get an error. For example, if gpu_num=8, the error is:
NotFoundError (see above for traceback): Key tower_7/c3d_var/conv2/weight_loss/loss not found in checkpoint [[Node: save/RestoreV2_183 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_183/tensor_names, save/RestoreV2_183/shape_and_slices)]] [[Node: save/RestoreV2_91/_207 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_1021_save/RestoreV2_91", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
@frankgu Hi Frank, thank you for providing the fork. I have tried to run train_c3d.py in your fork version, but meet the following error. Could you please check it?
F tensorflow/core/framework/tensor_shape.cc:172] Check failed: size >= 0 (-9619537920 vs. 0)
Thanks a lot.
Hi @joefang66 you can try the latest code once more, it has supported TF 1.2 version now.
I used your code in my own dataset. Your code works well when gpu_num=1. But when I set gpu_num=2, I get an error:
Traceback (most recent call last): File "train_c3d_ucf101.py", line 344, in
tf.app.run()
File "/mnt/xfs1/home/zhangyifan/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "train_c3d_ucf101.py", line 341, in main
run_training()
File "train_c3d_ucf101.py", line 202, in run_training
labels_placeholder[gpu_index FLAGS.batch_size:(gpu_index + 1) FLAGS.batch_size]
File "train_c3d_ucf101.py", line 97, in tower_loss
loss_averages_op = loss_averages.apply(losses + [total_loss])
File "/mnt/xfs1/home/zhangyifan/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/moving_averages.py", line 375, in apply
colocate_with_primary=(var.op.type in ["Variable", "VariableV2"]))
File "/mnt/xfs1/home/zhangyifan/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/slot_creator.py", line 174, in create_zeros_slot
colocate_with_primary=colocate_with_primary)
File "/mnt/xfs1/home/zhangyifan/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/slot_creator.py", line 149, in create_slot_with_initializer
dtype)
File "/mnt/xfs1/home/zhangyifan/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/slot_creator.py", line 66, in _create_slot_var
validate_shape=validate_shape)
File "/mnt/xfs1/home/zhangyifan/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1049, in get_variable
use_resource=use_resource, custom_getter=custom_getter)
File "/mnt/xfs1/home/zhangyifan/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 948, in get_variable
use_resource=use_resource, custom_getter=custom_getter)
File "/mnt/xfs1/home/zhangyifan/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 356, in get_variable
validate_shape=validate_shape, use_resource=use_resource)
File "/mnt/xfs1/home/zhangyifan/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 341, in _true_getter
use_resource=use_resource)
File "/mnt/xfs1/home/zhangyifan/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 671, in _get_single_variable
"VarScope?" % name)
ValueError: Variable IVA-research_1/var_name/weight_loss/loss/ does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope?
I am really confused, could you please help me ? @hx173149 @frankgu