Closed chengxuz closed 7 years ago
Steps to make it work in normalnet case:
In addition, it should be possible to do the validation on a different GPU and different cpu thread.
@qbilius, just to let you know, @chengxuz is going to work on this issue ASAP since it's important for his work. However, I'd like it if once he has a draft of the modifications in his new branch, you could go over the code and review it for him and you have good taste in code cleanliness. (I'll be in Japan).
Sure. I'm curious to see his solution to this issue.
On Thu, Jan 5, 2017, 19:51 Dan Yamins notifications@github.com wrote:
@qbilius https://github.com/qbilius, just to let you know, @chengxuz https://github.com/chengxuz is going to work on this issue ASAP since it's important for his work. However, I'd like it if once he has a draft of the modifications in his new branch, you could go over the code and review it for him and you have good taste in code cleanliness. (I'll be in Japan).
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/neuroailab/tfutils/issues/33#issuecomment-270801741, or mute the thread https://github.com/notifications/unsubscribe-auth/ABMaMRGkATwCS1YyQl7pIeJWGP4BbkqUks5rPZAJgaJpZM4LbKfi .
How's progress on this issue going? I'd prioritize this above some of the more cosmetic stuff...
I was working on building the same model for both validation and training and met some bugs. Still trying to fix them.
I implemented one version of this and pushed my modifications to branch "valAlloc". But I found that allocating validation model as the same model did not decrease the memory usage. Therefore, it seems that there is no needs for this implementation. Furthermore, we tested some situations for tensorflow GPU memory allocation. When "config.gpu_options.allow_growth" is set to be true, an AlexNet model with batch size of 256 will occupy 4.5GB roughly. And changing batch size to 512 will make it 8.4GB. But changing it to 1024 will only make it 11.7GB (max value for the gpu used). And the speed is very similar to 4 times of that under 256. So it seems that for memory allocating on one gpu, there seems to be hierarchies of strategies allocating the memory. The fastest strategy will take more memory and in this situation, those strategies do not have big differences. We also did some tests about memory allocating for 2 gpus. Again setting "config.gpu_options.allow_growth" to be true, no matter how large the model is, tensorflow will not occupy any memory larger than 115MB on second gpu (even when the memory allocation on first gpu fails). And starting a second training using the same configuration can not make tensorflow decide to use the second gpu either. So to use second gpu, one needs to state it explicitly in their model definition using something like tf.device(/gpu:1). Basically, tensorflow seems to be clever allocating gpu memory inside one gpu but needs instructions to begin memory allocation on second gpu. I will keep the changes in branch "valAlloc". Feel free to delete it if other things have changed too much to make those changes valuable.
Currently, validation will allocate a new model in GPU, which occupies some memory. We could make the model shared between validation and training if the input data source is in the same shape and then save the GPU memory.