Mermory Cost Increase - Githubissues

holyseven / PSPNet-TF-Reproduce

Training PSPNet in Tensorflow. Reproduce the performance from the paper.

MIT License

125 stars 30 forks source link

Mermory Cost Increase #21

Closed tabrisweapon closed 5 years ago

tabrisweapon commented 5 years ago

I tried the sync bn proposed by your codes and find that the memory cost increase tremendously. My experimental environment includes 4 Titan X GPU that could fit 5 batches per gpu when using Deeplab3+ as the solution, while I could only fit 1 batch per gpu after adopting sync bn. Also the running time is increased also.

Could this be caused by the absence of NVLink on my severs?

holyseven commented 5 years ago

batch size 5 -> 1

I am not sure which implementation you are using. The batch normalization implemented here is tf.nn.batch_normalization, which is slower and uses more memory than tf.nn.fused_batch_norm. But 5 -> 1 is too strange.

Just for info, maybe you have already done this, optimizer.minimize(cost, self.global_step, colocate_gradients_with_ops=True). colocate_gradients_with_ops: If True, try colocating gradients with the corresponding op.

Also the running time is increased also.

is probably caused by the absence of NVLink and the bn implementation. The communication between GPUs is in any case slower than operations in a single GPU.

tabrisweapon commented 5 years ago

Thanks for reply!

I didn't use minimize api since the original official deeplab collect gradients manually and I put up my codes on github to demonstrate my solution in detail.

https://github.com/tabrisweapon/official_deeplab_sync

The codes are modified from the official Deeplab implementation and for your convenience, sync_train.py is the entrance file with three other files in sync folder as assistant. Could you please tell me what is the problem? Is it because my server does not have nccl2 as mentioned in your README?

BTW i am trying another style of solutions as shown in my readme, which adopt tf.contrib.nccl as a reduce funciton.

holyseven commented 5 years ago

I don't use slim so I am not sure if there are something different there.

But What I found is this: here, the current method is to compute the loss of each sample and average the gradients. It is good if each sample is independent. However, if you use the sync solution with a list input, then each sample is related to each other because of bn. It will make the gradient computation very complicated and increase the memory usage.

So it is better to average the loss and then use compute_gradient(colocate_gradients_with_ops=True) once to obtain the gradients directly.

tabrisweapon commented 5 years ago

Thank you for your suggestion again! I'll give it a try in a few days. Currently the gpus are occupied by others, sad story.

GWwangshuo commented 5 years ago

Thank for your work. I have also tried your code here. But It seems that this code is indeed memory costly. When training ADE, I can only run this code with the batch_size=2. ( I have two Nvidia 1080Ti 11g GPUs) When I train deeplabv3+ with Pascal VOC 2012 dataset, it can normally run with batch_size=8. Could you please give me some hint since small batch size will also influence the final result?

holyseven commented 5 years ago

For ADE and deeplabv3, which code you were trying? the image crop sizes are the same? Did deeplabv3+ do the sync bn?

Just for info of the code here:

optimizer.minimize(cost, self.global_step, colocate_gradients_with_ops=True). colocate_gradients_with_ops: If True, try colocating gradients with the corresponding op.
the batch normalization implemented here is tf.nn.batch_normalization, which is slower and uses more memory than tf.nn.fused_batch_norm.
effective batch size = FLAGS.batch_size * gpu_num.