Open connor-create opened 2 years ago
You can try overwriting the dtype of the step_type
in the Policy given to the Driver.
@sguada Hello! I am also on @connor-create's team.
We tried this yesterday and found that indeed, by making step_type
an int64 it is correctly placed on the GPU device. However, XLA seems to still give similar errors as there are still some extraneous TF variables of enum
type that are placed on the CPU. We will continue to see if a similar fix can be found for this related issue.
Is there some working example of an XLA-enabled RL training loop with tf_agents
? That would be very helpful to us.
These are the remaining vars that seem to be giving problems:
-------------------------------------------------------------------------------------------------
tf.Tensor(<ResourceHandle(name="Resource-24-at-0x55eb9c29f270", device="/job:localhost/replica:0/task:0/device:CPU:0", container="Anonymous", type="tensorflow::Var", dtype and shapes : "[ DType enum: 9, Shape: [32000] ]")>, shape=(), dtype=resource)
-------------------------------------------------------------------------------------------------
tf.Tensor(<ResourceHandle(name="Resource-18-at-0x55eb9c24feb0", device="/job:localhost/replica:0/task:0/device:CPU:0", container="Anonymous", type="tensorflow::Var", dtype and shapes : "[ DType enum: 9, Shape: [32000] ]")>, shape=(), dtype=resource)
-------------------------------------------------------------------------------------------------
tf.Tensor(<ResourceHandle(name="Resource-19-at-0x55eb9c254f00", device="/job:localhost/replica:0/task:0/device:CPU:0", container="Anonymous", type="tensorflow::Var", dtype and shapes : "[ DType enum: 1, Shape: [32000,4] ]")>, shape=(), dtype=resource)
-------------------------------------------------------------------------------------------------
tf.Tensor(<ResourceHandle(name="Resource-20-at-0x55eb9c29a3e0", device="/job:localhost/replica:0/task:0/device:CPU:0", container="Anonymous", type="tensorflow::Var", dtype and shapes : "[ DType enum: 9, Shape: [32000] ]")>, shape=(), dtype=resource)
-------------------------------------------------------------------------------------------------
tf.Tensor(<ResourceHandle(name="Resource-21-at-0x55eb9c29c1b0", device="/job:localhost/replica:0/task:0/device:CPU:0", container="Anonymous", type="tensorflow::Var", dtype and shapes : "[ DType enum: 9, Shape: [32000] ]")>, shape=(), dtype=resource)
I also have questions about using XLA with tf_agents.
Is there some working example of an XLA-enabled RL training loop with tf_agents?
This would be amazing. Anyone have any info here at all?
Unfortunately the DynamicDriver has dynamic shapes and doesn't allow jit-compilation, you can compile the Network or the Policy though.
I'm getting similar error while training through model.fit()
with MirroredStrategy
on multiple GPUs. Is there solution or workaround to avoid XLA comilation. Also, what is the cause of this kind of error. In my case, it looks like there is a communication issues between GPU:0 and other three GPUs. Any help is much appreciated. Thank you!
I have built a dynamic step driver but cannot seem to get it to work with jit_compile=True.
Leads to this error on execution of training:
I'd imagine this is because of the tf_agent creating something with a dtype of int32, and that tensor then being placed on the CPU because of it. The first known issue here.
I have also explored the utility functions in xla.py but they don't seem to work either as the dynamic step driver will create a objects to pass into the compiled function in the kwargs, which XLA doesn't allow to happen.
Leads to this error on execution of the training:
Is there any way to apply XLA to these functions using some sort of trickle down configuration option or was this never implemented or supported? I don't see anything in the documentation about this usage either.
Thanks.