tensorflow / agents

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.
Apache License 2.0
2.77k stars 714 forks source link

Does TF-Agents not support XLA? #760

Open connor-create opened 2 years ago

connor-create commented 2 years ago

I have built a dynamic step driver but cannot seem to get it to work with jit_compile=True.

driver = Driver()
# Setup driver #

driver.run = tfa_common.function(driver.run, jit_compile=True)

Leads to this error on execution of training:

W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at xla_ops.cc:287 : INVALID_ARGUMENT: Trying to access resource  (defined @ /home/connorjaynes/.pyenv/versions/3.9.0/lib/python3.9/site-packages/tf_agents/replay_buffers/tf_uniform_replay_buffer.py:155) located in device /job:localhost/replica:0/task:0/device:CPU:0 from device /job:localhost/replica:0/task:0/device:GPU:0

I'd imagine this is because of the tf_agent creating something with a dtype of int32, and that tensor then being placed on the CPU because of it. The first known issue here.

I have also explored the utility functions in xla.py but they don't seem to work either as the dynamic step driver will create a objects to pass into the compiled function in the kwargs, which XLA doesn't allow to happen.

driver = Driver()
# Setup driver #

driver.run = xla.compile_in_graph_mode(driver.run)

Leads to this error on execution of the training:

kwargs are not supported for functions that are XLA-compiled, but saw kwargs: {'time_step': TimeStep(
{'discount': <tf.Tensor: shape=(32,), dtype=float32, numpy=
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
      dtype=float32)>,
 'observation': <tf.Tensor: shape=(32, 4), dtype=float32, numpy=
array([blah],
      dtype=float32)>,
 'reward': <tf.Tensor: shape=(32,), dtype=float32, numpy=
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
      dtype=float32)>,
 'step_type': <tf.Tensor: shape=(32,), dtype=int32, numpy=
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)>}), 'policy_state': ()}

Is there any way to apply XLA to these functions using some sort of trickle down configuration option or was this never implemented or supported? I don't see anything in the documentation about this usage either.

Thanks.

sguada commented 2 years ago

You can try overwriting the dtype of the step_type in the Policy given to the Driver.

sikanrong commented 2 years ago

@sguada Hello! I am also on @connor-create's team.

We tried this yesterday and found that indeed, by making step_type an int64 it is correctly placed on the GPU device. However, XLA seems to still give similar errors as there are still some extraneous TF variables of enum type that are placed on the CPU. We will continue to see if a similar fix can be found for this related issue.

Is there some working example of an XLA-enabled RL training loop with tf_agents? That would be very helpful to us.

These are the remaining vars that seem to be giving problems:

-------------------------------------------------------------------------------------------------
tf.Tensor(<ResourceHandle(name="Resource-24-at-0x55eb9c29f270", device="/job:localhost/replica:0/task:0/device:CPU:0", container="Anonymous", type="tensorflow::Var", dtype and shapes : "[ DType enum: 9, Shape: [32000] ]")>, shape=(), dtype=resource)
-------------------------------------------------------------------------------------------------
tf.Tensor(<ResourceHandle(name="Resource-18-at-0x55eb9c24feb0", device="/job:localhost/replica:0/task:0/device:CPU:0", container="Anonymous", type="tensorflow::Var", dtype and shapes : "[ DType enum: 9, Shape: [32000] ]")>, shape=(), dtype=resource)
-------------------------------------------------------------------------------------------------
tf.Tensor(<ResourceHandle(name="Resource-19-at-0x55eb9c254f00", device="/job:localhost/replica:0/task:0/device:CPU:0", container="Anonymous", type="tensorflow::Var", dtype and shapes : "[ DType enum: 1, Shape: [32000,4] ]")>, shape=(), dtype=resource)
-------------------------------------------------------------------------------------------------
tf.Tensor(<ResourceHandle(name="Resource-20-at-0x55eb9c29a3e0", device="/job:localhost/replica:0/task:0/device:CPU:0", container="Anonymous", type="tensorflow::Var", dtype and shapes : "[ DType enum: 9, Shape: [32000] ]")>, shape=(), dtype=resource)
-------------------------------------------------------------------------------------------------
tf.Tensor(<ResourceHandle(name="Resource-21-at-0x55eb9c29c1b0", device="/job:localhost/replica:0/task:0/device:CPU:0", container="Anonymous", type="tensorflow::Var", dtype and shapes : "[ DType enum: 9, Shape: [32000] ]")>, shape=(), dtype=resource)
chazzmoney commented 2 years ago

I also have questions about using XLA with tf_agents.

Is there some working example of an XLA-enabled RL training loop with tf_agents?

This would be amazing. Anyone have any info here at all?

sguada commented 2 years ago

Unfortunately the DynamicDriver has dynamic shapes and doesn't allow jit-compilation, you can compile the Network or the Policy though.

dalalkrish commented 9 months ago

I'm getting similar error while training through model.fit() with MirroredStrategy on multiple GPUs. Is there solution or workaround to avoid XLA comilation. Also, what is the cause of this kind of error. In my case, it looks like there is a communication issues between GPU:0 and other three GPUs. Any help is much appreciated. Thank you!