Closed nakosung closed 9 years ago
I'm sorry that I don't have enough time to work on this project right now, but I will come back to this and finish TODO lists as soon as possible. I'm not sure that I understand your idea correctly. Does it mean one forward/backward pass with N*M-sized data blob or M forward/backward passes with N-sized data blob? In the former case, we have to consider the memory limit. For example, 100-sized 20 mini-batches is roughly equivalent to 2000 mini-batches in feedforward networks in terms of memory usage. In the latter case, we have to compute the sum of gradients over several forward/backward passes. The main problem is that every layer does not preserve gradient(diff) after one backward pass, because one backward pass always involves one weight update in caffe. I hope we will find a clever way to deal with this issue. Thank you for sharing your idea!
Thank you for your detail explanation. My question was about the former approach as you said. Some researches seem to use 4-size(quite small) minibatch for RNN update. If '4' is not so bad for mini batch size, memory requirement doesn't seem to be as a big problem. For the latter one M-fwd and 1-bwd seems to be sufficient to BP. Isn't it? 1) accumulate gradient for each weight during grouped forward passes 2) and back propagate to achieve large sized mini-batch with 'physical' small-batch.
The former approach seems to be a simple and good option. Thank you for your comment!
This implementation supports mini-batch update now.
As you mentioned in caffe official repo, your implementation doesn't support mini-batch. What is your plan to extend your implementation? To support N-sized truncated BPTT with M minibatch, is introducing M sequential data layers good enough?