Accelerate MVE DDPG - Githubissues

Currently the MVE pass in DDPG is very inefficiently, and based on initial benchmarks it seems like all the time is spent inside the tensorflow sess.run(optimize) call.

This is likely not due to the input pipeline--feed dict is fine because DDPG without MVE is much faster, and transfers as much data between the CPU and GPU.

Instead, this is due to the additional computation that MVE requires when doing a model-based unrolling of the H-step horizon: my guess is that TF has a poor default CUDA kernel for this two-step process:

for every item in the batch, simulate H steps ahead with the model (currently a python for loop making H tensorflow nodes -- this manually unrolled loop already improves over tf.while_loop).

In reverse, compute the losses for each of the timesteps. If TF has a smart XLA optimizer this isn't going to get any faster. One immediate improvement that comes to mind is the automatic batch size propogation since the batch size is always fixed. Then TF can pre-allocate tensors. But I didn't get this to work when I tried it.

vlad17 / mve

Accelerate MVE DDPG #351