hx173149 / C3D-tensorflow

C3D is a modified version of BVLC tensorflow to support 3D ConvNets.
MIT License
588 stars 262 forks source link

Error when using multi-GPU #21

Closed kcheng999 closed 7 years ago

kcheng999 commented 7 years ago

I used your code in my own dataset. Your code works well when gpu_num=1. But when I set gpu_num=2, I get an error:

Traceback (most recent call last): File "train_c3d_ucf101.py", line 344, in tf.app.run() File "/mnt/xfs1/home/zhangyifan/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "train_c3d_ucf101.py", line 341, in main run_training() File "train_c3d_ucf101.py", line 202, in run_training labels_placeholder[gpu_index FLAGS.batch_size:(gpu_index + 1) FLAGS.batch_size] File "train_c3d_ucf101.py", line 97, in tower_loss loss_averages_op = loss_averages.apply(losses + [total_loss]) File "/mnt/xfs1/home/zhangyifan/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/moving_averages.py", line 375, in apply colocate_with_primary=(var.op.type in ["Variable", "VariableV2"])) File "/mnt/xfs1/home/zhangyifan/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/slot_creator.py", line 174, in create_zeros_slot colocate_with_primary=colocate_with_primary) File "/mnt/xfs1/home/zhangyifan/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/slot_creator.py", line 149, in create_slot_with_initializer dtype) File "/mnt/xfs1/home/zhangyifan/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/slot_creator.py", line 66, in _create_slot_var validate_shape=validate_shape) File "/mnt/xfs1/home/zhangyifan/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1049, in get_variable use_resource=use_resource, custom_getter=custom_getter) File "/mnt/xfs1/home/zhangyifan/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 948, in get_variable use_resource=use_resource, custom_getter=custom_getter) File "/mnt/xfs1/home/zhangyifan/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 356, in get_variable validate_shape=validate_shape, use_resource=use_resource) File "/mnt/xfs1/home/zhangyifan/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 341, in _true_getter use_resource=use_resource) File "/mnt/xfs1/home/zhangyifan/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 671, in _get_single_variable "VarScope?" % name) ValueError: Variable IVA-research_1/var_name/weight_loss/loss/ does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope?

I am really confused, could you please help me ? @hx173149 @frankgu

gudongfeng commented 7 years ago

There are some variable reuse problems with multiple GPU support, I have fixed them before, but the performance doesn't increase (even decrease with multiple GPU) so I just leave it like this. I think the better choice is to use single GPU for computing. If you really want to use multiple GPU, you need to fix the variable reuse problem first and then figure out the performance problem. Check my fork for this project, maybe this could help you better understand the project. Cheers. @kcheng999

gudongfeng commented 7 years ago

I think when you mention weight decay you are talking about this. Sorry for the confusing word performance, what I mean here is the training speed not the accuracy. The training speed is decreasing when I apply multiple GPU. @kcheng999

kcheng999 commented 7 years ago

Thank you. But I am still confused about the speed. For example, When I use single GPU, I set batchsize=20. If I use 2 GPUs, I can process 40 clips together. In theory, the speed is twice, but in fact the speed is about 1.8 times than single GPU. Do you mean that the speed is slower than twice the speed of single GPU? @frankgu

gudongfeng commented 7 years ago

Split the GPU with batch size is feasible, however, we have to think about how to merge the training result together, for example, when GPU 1 is training 20 batches data and our mode is being modified, while GPU 2 is also changing our model. How can we handle this conflict?

kcheng999 commented 7 years ago

Refer to this structure: https://www.tensorflow.org/images/Parallelism.png I think your code is the same with this structure. I have read the code https://github.com/hx173149/C3D-tensorflow and I have not found any mistakes. I wonder about the variable reuse problems you mentioned before, because I didn't find them. Could you please tell me how the variable reuse problem occurs in that code? @frankgu

gudongfeng commented 7 years ago

I will try to fix the reuse problem in my fork and notify you when I am done. Cheers. @kcheng999

gudongfeng commented 7 years ago

have a try @kcheng999 fork

kcheng999 commented 7 years ago

Thank you so much! I will try you code on my datasets. @frankgu

kcheng999 commented 7 years ago

I test your new code, and find a restore bug. In your code, you writeconv3 = tf.concat((pool2, conv3), 4) ,so here the conv3 is 384 channels and conv4 is 768 channels. But in the sports1M pretrained model, the conv3 is 256 channels and conv4 is 512 channels. They don't match. Therefore, I get an error:

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [768] rhs shape= [512] [[Node: save_1/Assign_4 = Assign[T=DT_FLOAT, _class=["loc:@c3d_var/conv4/biases_a"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/cpu:0"](c3d_var/conv4/biases_a, save_1/RestoreV2_4)]]

So could you please tell me how you use the pretrain model? @frankgu

kcheng999 commented 7 years ago

I delete the concatenate operation, and run your code. But I found another bug. @frankgu The code works well when gpu_num =1 or 2. But when I set gpu_num>2, I get an error. For example, if gpu_num=8, the error is:

NotFoundError (see above for traceback): Key tower_7/c3d_var/conv2/weight_loss/loss not found in checkpoint [[Node: save/RestoreV2_183 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_183/tensor_names, save/RestoreV2_183/shape_and_slices)]] [[Node: save/RestoreV2_91/_207 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_1021_save/RestoreV2_91", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

joefang66 commented 7 years ago

@frankgu Hi Frank, thank you for providing the fork. I have tried to run train_c3d.py in your fork version, but meet the following error. Could you please check it?

F tensorflow/core/framework/tensor_shape.cc:172] Check failed: size >= 0 (-9619537920 vs. 0)

Thanks a lot.

hx173149 commented 7 years ago

Hi @joefang66 you can try the latest code once more, it has supported TF 1.2 version now.