Closed mhbassel closed 1 year ago
There is an issue related here: https://github.com/google/jax/issues/11190
EDIT: Running server with TensorFlow V1 the error message changes slightly and becomes a warning:
2022-08-04 16:08:23.881529: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1820] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
2022-08-04 16:08:24.029538: W tensorflow/core/common_runtime/process_function_library_runtime.cc:688] Ignoring multi-device function optimization failure: Invalid argument: Node '_arg_image_tensor_0_0_0_arg': Node name contains invalid characters
2022-08-04 16:08:24.313556: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2022-08-04 16:08:24.581572: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: couldn't get temp CUBIN file name
Relying on driver to perform ptx compilation. This message will be only logged once.
Sorry, I am not sure if this issue even related to Triton anymore or not!
The error is indeed coming within the model.
2022-08-04 16:08:24.581572: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: couldn't get temp CUBIN file name
Relying on driver to perform ptx compilation. This message will be only logged once.
Does inference run successfully on the model after this one-time message?
Can you load and run inference on the model outside Triton?
Hi Tanmay, thanks for your reply!
Does inference run successfully on the model after this one-time message?
Yes, it does.
Can you load and run inference on the model outside Triton?
I did, and it worked too, I tried both with CPU and GPU.
FYI, on my host machine I have cuda-11.2, cuDNN 8.1, and I had a running TF object detection Training while I was working with Triton, too bad that I started to have the problem there as well, but as a warning, I am using TF 2.8. Although the earlier logs of the training didn't show those warning messages. I am still trying to find out what is the cause, maybe my machine (?).
Hi again @tanmayv25. I wanted to tell you that the problem has been resolved somehow after a system reboot! I rebooted the machine due to GPUs problems, like Unable to determine the device handle for GPU 0000:82:00.0: Unknown Error
when running nvidia-smi
and Xid 79, GPU has fallen off the bus
in the system logs (Maybe overheating problem since I was running many things at once (?)). And after the reboot, The error didn't happen and the Triton Server worked normally and I inferenced from it successfully.
I don't really know what was the exact cause.
Interesting. Thanks for the update and glad you were able to resolve it. I don't really know what could have gone wrong but the issue appears to be the GPU state. Closing the issue as the issue appears to be the model related. Please open up a new issue if you have reason to believe that Triton causes the issue to appear.
Hi everyone, I am really struggling finding a solution for this problem. It happens when I run the server with TensorFlow Model using the GPUs, but I get this error (Full log):
config:
command:
in case needed:
Looks like it is trying to get cubin path but it can't find it (?). Reference
In addition, when commenting the optimization part in the config OR when running only on CPU (
KIND_CPU
), the error doesn't occur.Does anyone please have any idea how to solve this?
Thank you in advance!