Open hercule24 opened 1 year ago
I've shared this internally, hopefully someone who can address this will get back to you soon!
Hey @ddunl , is there any update?
I still haven't heard back from anyone internally. I'd be interested to know if this problem still occurs on a more recent version of TF, I think TF 2.4 is nearly 2 years old
I am considering migrating to TF 2.11, it needs some work, but I am hoping to know more about the root cause before I migrate.
I think it'll be easier for me to find someone who can help if you can give a minimal reproducible example, I think it's quite difficult to debug it without seeing the code
Well...I cannot share the code because company policy won't allow it I think. On the other hand, I guess it's thrown from this line: https://github.com/tensorflow/tensorflow/blob/v2.4.0/tensorflow/compiler/jit/xla_device_ops.cc#L66
Also, decorating with @tf.function(experimental_compile=True)
throws unsupported op: No registered 'SparseTensorDenseMatMul' OpKernel for XLA_GPU_JIT devices compatible with node
error while auto clustering throws the above error.
Hey @ddunl , thank you for taking the time looking into this.
After many trials and errors by removing layers or swapping out layers with dummy ones, I discovered that tf.keras.layers.Embedding
was the problematic layer, specifically, it's the tf.gather
used internally by Embedding
layer that will throw the above error.
Do you know why tf.gather
was not compatible? Will upgrading to TF 2.11 fix it?
Unfortunately I don't know, I think you may have better luck filing an issue on tensorflow instead. Sorry I can't give a more helpful answer!!
Hi XLA Experts,
We are using Tensorflow (2.4) together with Horovod (0.23) to do distributed training. We turned on auto clustering via
tf.config.optimizer.set_jit(True)
. However it throws the following error:I am not sure if this is right place for me to ask this question, but it greatly helps if you could take a quick look and suggest on how I can further debug. Thank you in advance!