tensorflow / fold

Deep learning with dynamic computation graphs in TensorFlow
Apache License 2.0
1.83k stars 266 forks source link

Is it possible to do conditional branching during evaluation? #8

Open MycChiu opened 7 years ago

MycChiu commented 7 years ago

Hi, first of all, thanks for sharing this library. It's really awesome.

I was trying to implement Hierachical Multiscale Recurrent Neural Network, where the network will need to decide which path to take depending on the boundary gate from the previous time step and that of the lower layer.

At first, I thought this library is perfect fit for the task, but after reading the examples and the source code for loom and weaver, I get the impression that you would need to decide the structure of the graph for each input prior to evaluation, so weaver can figure out a efficient way to batch the inputs at each depth.

Which means weaver can't do much in the case of HMRNN, where the path each input take will depend on the intermediate values during evaluation. Is this understanding correct? Or is there a good way to implement this in Fold efficiently?

delesley commented 7 years ago

Your understanding is correct. This is an issue that we are very much aware of, but TensorFlow has some fundamental limitations that make it difficult. The problem essentially boils down to the fact that the only way to get a result back from TensorFlow is to call session.run(), but you can't backpropagate across multiple calls to run(). There is thus no way for control flow to depend on intermediate results. We have been discussing several possible strategies for solving the problem, but there's no easy fix.

-DeLesley

On Fri, Feb 10, 2017 at 12:38 AM, MycChiu notifications@github.com wrote:

Hi, first of all, thanks for sharing this library. It's really awesome.

I was trying to implement Hierachical Multiscale Recurrent Neural Network https://arxiv.org/pdf/1609.01704.pdf, where the network will need to decide which path to take depending on the boundary gate from the previous time step and that of the lower layer.

At first, I thought this library is perfect fit for the task, but after reading the examples and the source code for loom and weaver, I get the impression that you would need to decide the structure of the graph for each input prior to evaluation, so weaver can figure out a efficient way to batch the inputs at each depth.

Which means weaver can't do much in the case of HMRNN, where the path each input take will depend on the intermediate values during evaluation. Is this understanding correct? Or is there a good way to implement this in Fold efficiently?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/fold/issues/8, or mute the thread https://github.com/notifications/unsubscribe-auth/AGGbTciJC7ejx3SGAhwpeZvzXjCU1N8eks5rbCIQgaJpZM4L9INk .

-- DeLesley Hutchins | Software Engineer | delesley@google.com | 505-206-0315

MycChiu commented 7 years ago

I see, thanks for the answer. I actually managed to achieve this with tf.dynamic_partition on the batch index and then use the results to gather different branches' tensors, but since this requires a lot of communication between CPU and GPU, it was painfully slow. I wonder if it is possible to implement a GPU kernel for dynamic_partition, so the process could be more efficient.

I also looked into the new GPU queue, StagingArea. I thought we might be able to distribute the data to different queues according to their branch in one while_loop, then flush the queues and process them differently in the next loop. However, as of now, StagingArea is quite rudimentary and doesn't support batching, and I think it would also be impossible to back-prop the gradients through the queue, so this seemed like a dead end to me as well.

I am also really keen on solving this problem, so if there is any way I can help, please let me know.

delesley commented 7 years ago

I think the queue is a dead end for the reasons you mention.

TensorFlow does have its own control-flow operators, so if you can confine the control-flow to within a single LoomOp then you can do branching and loops that way. What you can't do is use intermediate tensor results to drive control-flow (recursion, etc.) inside python. In addition, TensorFlow Fold's dynamic batching only applies at the granularity of LoomOps (i.e. in python); it won't work on control-flow within a LoomOp. For hierarchical multiscale networks, you may be able to loop over the characters in a single word using tf.while, and pad out a batch of words to the same length using masks of some kind -- e.g. lock the input/forget gates to 0/1 at end-of-word. Pretty clunky, I know.

As I said before, there's not a quick fix here -- doing it properly is going to involve some significant changes to the underlying framework, so I don't have a good ETA for this feature as of yet. If you're in a hurry, Dynet, Chainer, and PyTorch also support dynamic computation graphs, although they'll run significantly slower without dynamic batching.

On Fri, Feb 10, 2017 at 6:59 PM, MycChiu notifications@github.com wrote:

I see, thanks for the answer. I actually managed to achieve this with tf.dynamic_partition on the batch index and then use the results to gather different branches' tensors, but since this requires a lot of communication between CPU and GPU, it was painfully slow. I wonder if it is possible to implement a GPU kernel for dynamic_partition, so the process could be more efficient.

I also looked into the new GPU queue, StagingArea https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/data_flow_ops.py#L1403. I thought we might be able to distribute the data to different queues according to their branch in one while_loop, then flush the queues and process them differently in the next loop. However, as of now, StagingArea is quite rudimentary and doesn't support batching, and I think it would also be impossible to back-prop the gradients through the queue, so this seemed like a dead end to me as well.

I am also really keen on solving this problem, so if there is any way I can help, please let me know.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/fold/issues/8#issuecomment-279116139, or mute the thread https://github.com/notifications/unsubscribe-auth/AGGbTacdTXlXaIx1h1UrGd4ijyuEi_t1ks5rbSP3gaJpZM4L9INk .

-- DeLesley Hutchins | Software Engineer | delesley@google.com | 505-206-0315

MycChiu commented 7 years ago

I see, thanks for the suggestions, should I close this issue for now then?

bhack commented 7 years ago

We could leave this open and labeled for tracking.

ronghanghu commented 7 years ago

A tricky workaround I'm using is to split the forward and backward into multiple sess.run, and glue them together with some auxiliary variables. For example, the first sess.run can store the results into some tf.Variable, and the second sess.run continue the execution using the results stored in those variables, and so on.

For the backward pass, there also needs to be some auxiliary variables to glue the gradients.

MycChiu commented 7 years ago

@ronghanghu Yeah, that's essentially what I am trying to do now, but adding the requirement of BPTT to the mix makes this a lot more complicated. EDIT: also, do you use auto differentiation on your backward pass? or do you do backward pass manually? Since tensorflow doesn't store the intermediate values between sess.run, it seems like if you use the ops generated by tf.gradient and the intermediate activation stored in the tf.Variable, tensorflow would have to do another forward pass to calculate the gradients.

ronghanghu commented 7 years ago

@MycChiu Yes, that's was a big headache for me. My tricky implementation involves anther forward pass when calculating the gradients, and to make it work I had to remove all the stochastic operations (e.g. dropout)

ronghanghu commented 7 years ago

Also, sess.partial_run may be a solution, but somehow I couldn't get it work.

MycChiu commented 7 years ago

@ronghanghu Have you tried putting the dropout mask in a variable as well? Use that variable as the mask throughout the process, then just assign a new one in the new batch of forward pass.It wouldn't be too much trouble if you use the "theoretically grounded" dropout method by Gal and Ghahramani, since you are suppose to reuse the mask throughout the whole sequence anyway, so you would only need to keep track of a handful of masks depending on how many layers you have.

I have always had sess.partial_run in the back of my mind, but never actually gave it a shot. I guess now would be a great time.

MycChiu commented 7 years ago

Okay, I have played a bit with sess.partial_run, and the main advantage is that we can do conditional branching in the middle with python code, but every tensor can only be run once, so we will have to flatten the whole path into a really long feed forward network for this to work. Additionally, since it delegates the control back to python, it would also require a lot of CPU-GPU communication, so I guess it's not much better than tf.dynamic_partition and tf.dynamic_stitch approach I was using.

MycChiu commented 7 years ago

well...after some profiling, I realized that tf.dynamic_stitch actually doesn't have GPU implementation, so the slowness was actually due to tf.dynamic_stitch constantly transferring all states back and forth between CPU and GPU for every timestep. profiled on nvvp And the observation is confirmed by this issue on Tensorflow, so if I write a CUDA kernel for tf.dynamic_stitch, my current approach could actually be viable, although I don't really know how it could be incorporated into the Fold library though.

delesley commented 7 years ago

sess.partial_run is the only reasonable way that TensorFlow currently provides for this sort of task. It's possible to trap the gradients yourself and feed them back in to another session.run, but that's both difficult and error-prone -- not really something that you want to be doing by hand.

What's currently missing is any support for sess.partial_run in the TensorFlow Fold APIs. It will take some somewhat substantial changes to get that working smoothly with the loom and blocks libraries.

IMO, the CPU/GPU communication issue shouldn't be too bad. If you keep the computation about which branch to take on the GPU, then you only have to transfer a single boolean to the CPU. I'm more worried about the interpreter overhead -- you have to exit the main TensorFlow interpreter loop, go all the way back to python, and then re-enter the loop; I haven't measured the cost of that overhead. Moreover, chopping the computation up into pieces like this eliminates opportunities for dynamic batching, which can also slow things down rather dramatically.

-DeLesley

On Sat, Feb 18, 2017 at 7:35 PM, Ronghang Hu notifications@github.com wrote:

Also, sess.partial_run may be a solution, but somehow I couldn't get it work.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/fold/issues/8#issuecomment-280893391, or mute the thread https://github.com/notifications/unsubscribe-auth/AGGbTYY8bonCZ2NPO94L3DGl1zeSgFXVks5rd7iagaJpZM4L9INk .

-- DeLesley Hutchins | Software Engineer | delesley@google.com | 505-206-0315

bhack commented 7 years ago

In Caffe we introduced a filter layer trick some time ago to have a conditional path but I don't think that this approach could cover your cases.

ronghanghu commented 7 years ago

My model involves first generating a parsing tree from a neural parser (implemented in TensorFlow) and a TreeLSTM running over the previous parsing tree (also implemented in TensorFlow, using Fold). To tie them end to end, my current workaround is to tie them together with sess.partial_run. (I found partial run didn't work due to I'm using an interactive session and https://github.com/tensorflow/tensorflow/issues/3233).

Btw, regarding Caffe, there is an extension developed by Stanford people called ApolloCaffe (http://apollocaffe.com) which, of course, does not support dynamic batching.

bhack commented 7 years ago

Filter layer create a dynamic tensor on the batch dimension

zwep commented 4 years ago

Just a quick question.. the first answer said

"he problem essentially boils down to the fact that the only way to get a result back from TensorFlow is to call session.run(), but you can't backpropagate across multiple calls to run(). "

Since in Tensorflow 2.0 there are no sessions anymore.. does this mean that this issue can be resolved?

delesley commented 4 years ago

In theory, yes, but the current code still uses the session API.

On Tue, Dec 3, 2019 at 2:02 AM zwep notifications@github.com wrote:

Just a quick question.. the first answer said

"he problem essentially boils down to the fact that the only way to get a result back from TensorFlow is to call session.run(), but you can't backpropagate across multiple calls to run(). "

Since in Tensorflow 2.0 there are no sessions anymore.. does this mean that this issue can be resolved?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/fold/issues/8?email_source=notifications&email_token=ABQZWTLQWLA5GUS6VFHFYT3QWYVB7A5CNFSM4C7UQNSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFYZQEI#issuecomment-561092625, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQZWTILTZVOOJSVANPVVU3QWYVB7ANCNFSM4C7UQNSA .

-- DeLesley Hutchins | Software Engineer | delesley@google.com | 505-206-0315