neuroailab / tfutils

Utilities for working with tensorflow
MIT License
25 stars 8 forks source link

Better allocating models during validation #33

Closed chengxuz closed 7 years ago

chengxuz commented 7 years ago

Currently, validation will allocate a new model in GPU, which occupies some memory. We could make the model shared between validation and training if the input data source is in the same shape and then save the GPU memory.

chengxuz commented 7 years ago

Steps to make it work in normalnet case:

  1. put a tensorflow switch operation in normalnet to allow it to be in different phases.
  2. Allow for a composite data provider with the ability of switching between different phases and different batches (including modification of queue class if needed) Have different data providers for different validation or training and then set the input node to have the "cond" to make it possible for switching from different phases.
  3. IN tfutils: allow the option to allow the validation process and train process for passing the model with the ability of switching from different phases (model function should return a third node for the phase)

In addition, it should be possible to do the validation on a different GPU and different cpu thread.

yamins81 commented 7 years ago

@qbilius, just to let you know, @chengxuz is going to work on this issue ASAP since it's important for his work. However, I'd like it if once he has a draft of the modifications in his new branch, you could go over the code and review it for him and you have good taste in code cleanliness. (I'll be in Japan).

qbilius commented 7 years ago

Sure. I'm curious to see his solution to this issue.

On Thu, Jan 5, 2017, 19:51 Dan Yamins notifications@github.com wrote:

@qbilius https://github.com/qbilius, just to let you know, @chengxuz https://github.com/chengxuz is going to work on this issue ASAP since it's important for his work. However, I'd like it if once he has a draft of the modifications in his new branch, you could go over the code and review it for him and you have good taste in code cleanliness. (I'll be in Japan).

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/neuroailab/tfutils/issues/33#issuecomment-270801741, or mute the thread https://github.com/notifications/unsubscribe-auth/ABMaMRGkATwCS1YyQl7pIeJWGP4BbkqUks5rPZAJgaJpZM4LbKfi .

yamins81 commented 7 years ago

How's progress on this issue going? I'd prioritize this above some of the more cosmetic stuff...

chengxuz commented 7 years ago

I was working on building the same model for both validation and training and met some bugs. Still trying to fix them.

chengxuz commented 7 years ago

I implemented one version of this and pushed my modifications to branch "valAlloc". But I found that allocating validation model as the same model did not decrease the memory usage. Therefore, it seems that there is no needs for this implementation. Furthermore, we tested some situations for tensorflow GPU memory allocation. When "config.gpu_options.allow_growth" is set to be true, an AlexNet model with batch size of 256 will occupy 4.5GB roughly. And changing batch size to 512 will make it 8.4GB. But changing it to 1024 will only make it 11.7GB (max value for the gpu used). And the speed is very similar to 4 times of that under 256. So it seems that for memory allocating on one gpu, there seems to be hierarchies of strategies allocating the memory. The fastest strategy will take more memory and in this situation, those strategies do not have big differences. We also did some tests about memory allocating for 2 gpus. Again setting "config.gpu_options.allow_growth" to be true, no matter how large the model is, tensorflow will not occupy any memory larger than 115MB on second gpu (even when the memory allocation on first gpu fails). And starting a second training using the same configuration can not make tensorflow decide to use the second gpu either. So to use second gpu, one needs to state it explicitly in their model definition using something like tf.device(/gpu:1). Basically, tensorflow seems to be clever allocating gpu memory inside one gpu but needs instructions to begin memory allocation on second gpu. I will keep the changes in branch "valAlloc". Feel free to delete it if other things have changed too much to make those changes valuable.