OP_REQUIRES failed at xla_ops : UNIMPLEMENTED: Could not find compiler for platform CUDA: NOT_FOUND

kmkolasinski commented 6 months ago

Bug Report

Does Tensorflow Serving support XLA compiled SavedModels ? or am I doing something wrong ?

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Dockerfile
TensorFlow Serving installed from (source or binary): No
TensorFlow Serving version: 2.13.1-gpu

Describe the problem

Hi, I'm trying to run XLA compiled models via Tensorflow Serving, however it seems to not work for me.

Here is the notebook I used to create XLA/AMP compiled SavedModel of very simple classifiers like ResNet50 https://github.com/kmkolasinski/triton-saved-model/blob/main/notebooks/export-classifier.ipynb

When running the TFServing server I can see following warning in the console

tf_serving_server-1  | 2024-04-07 07:57:23.842168: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1828] OP_REQUIRES failed at xla_ops.cc:503 : UNIMPLEMENTED: Could not find compiler for platform CUDA: NOT_FOUND: could not find registered compiler for platform CUDA -- was support for that platform linked in?

I get similar message on the client side

_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.UNIMPLEMENTED
    details = "2 root error(s) found.
  (0) UNIMPLEMENTED: Could not find compiler for platform CUDA: NOT_FOUND: could not find registered compiler for platform CUDA -- was support for that platform linked in?
     [[{{function_node __inference_predict_images_90382}}{{node StatefulPartitionedCall}}]]
     [[cluster_6_1/merge_oidx_0/_655]]
  (1) UNIMPLEMENTED: Could not find compiler for platform CUDA: NOT_FOUND: could not find registered compiler for platform CUDA -- was support for that platform linked in?
     [[{{function_node __inference_predict_images_90382}}{{node StatefulPartitionedCall}}]]
0 successful operations.
0 derived errors ignored."
    debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:8500 {grpc_message:"2 root error(s) found.\n  (0) UNIMPLEMENTED: Could not find compiler for platform CUDA: NOT_FOUND: could not find registered compiler for platform CUDA -- was support for that platform linked in?\n\t [[{{function_node __inference_predict_images_90382}}{{node StatefulPartitionedCall}}]]\n\t [[cluster_6_1/merge_oidx_0/_655]]\n  (1) UNIMPLEMENTED: Could not find compiler for platform CUDA: NOT_FOUND: could not find registered compiler for platform CUDA -- was support for that platform linked in?\n\t [[{{function_node __inference_predict_images_90382}}{{node StatefulPartitionedCall}}]]\n0 successful operations.\n0 derived errors ignored.", grpc_status:12, created_time:"2024-04-07T09:57:23.84265796+02:00"}"
>

Exact Steps to Reproduce

You can find my repo where I compare Triton Server (python backend) with TFServing here: https://github.com/kmkolasinski/triton-saved-model. In the notebooks directory you will find

a notebook which created SavedModel (with XLA and without)
another notebook which runs predictions


@dataclass(frozen=True)
class TFServingGRPCClient:
    model_name: str
    url: str = "localhost:8500"

    def predict(
        self, signature: str, *, images: np.ndarray, timeout: int = 60
    ) -> predict_pb2.PredictResponse:

        options = [("grpc.max_receive_message_length", 1 << 30)]
        channel = grpc.insecure_channel(self.url, options=options)
        stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

        grpc_request = predict_pb2.PredictRequest()
        grpc_request.model_spec.name = self.model_name
        grpc_request.model_spec.signature_name = signature

        grpc_request.inputs["images"].CopyFrom(tf.make_tensor_proto(images))

        predict_response = stub.Predict(grpc_request, timeout)
        channel.close()
        return predict_response

images = tf.random.uniform((100, 224, 224, 3), minval=0, maxval=255, dtype=tf.int32)
images = tf.cast(images, tf.uint8)

client = TFServingGRPCClient("resnet50-no-opt")
result = client.predict("predict_images", images=images)
# works correcltly

client = TFServingGRPCClient("resnet50-xla-amp")
result = client.predict("predict_images", images=images)
# fails with UNIMPLEMENTED: Could not find compiler for platform CUDA: NOT_FOUND:

Is this expected behavior ? I am aware of this flag

 --xla_cpu_compilation_enabled=false     bool    EXPERIMENTAL; CAN BE REMOVED ANYTIME! Enable XLA:CPU JIT (default is disabled). With XLA:CPU JIT disabled, models utilizing this feature will return bad Status on first compilation request.

however, I was not able to find any reasonable resources on how to use it, to test my case.

singhniraj08 commented 5 months ago

@kmkolasinski, --xla_cpu_compilation_enabled=true parameter should be passed as additional argument to enable XLA:CPU JIT (default is disabled). Can you try creating a TF Serving docker container with additional parameters as shown in example and see if model inferencing works. Thank you!

kmkolasinski commented 5 months ago

Hi @singhniraj08, yes I tried this appraoch, please check out my docker (at CMD command) which I used for testing: https://github.com/kmkolasinski/triton-saved-model/blob/main/tf_serving/Dockerfile

Here is my docker compose which I used to run serving: https://github.com/kmkolasinski/triton-saved-model/blob/main/docker-compose.yml

# Firstly, use https://github.com/kmkolasinski/triton-saved-model/blob/main/notebooks%2Fexport-classifier.ipynb to export various classifiers
docker compose up tf_serving_server

I prepared this repository which reproduces this issue https://github.com/kmkolasinski/triton-saved-model/tree/main

kmkolasinski commented 5 months ago

Hi @YanghuaHuang did you have time to take a look at this issue ?

YanghuaHuang commented 4 months ago

Sorry for the late reply. Assign to @guanxinq to triage, who has better knowledge on this.

kmkolasinski commented 4 months ago

Hi @YanghuaHuang thanks, I just wonder whether we can use XLA compiled models in TF Serving or not. If yes, how we can achieve it as I couldn't find any information about this.

YanghuaHuang commented 4 months ago

I think tfs does support XLA CPU but not GPU. But I could be wrong.

@gharibian Hey Dero, can you help on this? Thanks!

kmkolasinski commented 4 months ago

Thanks for the answer. If this is a truth, the message

Could not find compiler for platform CUDA: NOT_FOUND

makes a perfect sense to me now and that's a pitty. I assumed that TF Serving is using the same C++ backend to run SavedModel graph as TF libraries, so any SavedModel which I can run via python code I can also run via TF Serving. Let's wait for the confirmation from @gharibian .

kmkolasinski commented 3 months ago

hey @gharibian did you have time to take a look at this thread ?

tensorflow / serving