tensorflow / serving

A flexible, high-performance serving system for machine learning models
https://www.tensorflow.org/serving
Apache License 2.0
6.13k stars 2.19k forks source link

OP_REQUIRES failed at xla_compile_on_demand_op.cc:290 : UNIMPLEMENTED: Could not find compiler for platform CUDA: NOT_FOUND #2217

Open rb-23 opened 3 months ago

rb-23 commented 3 months ago

Bug Report

If this is a bug report, please fill out the following form in full:

System information

Describe the problem

Although CUDA and all other relevant libraries were linked in, when running inference on the model server, the CUDA compiler is not found. This does not happen if i try to run other models with the same containers.

Exact Steps to Reproduce

  1. Download and build the relevant singularity containers from the docker-hub, with sudo singularity build. The following are the container definition file for the tensorflow-serving container, as well as the base tensorflow container to run inference from.

base_tensorflow_container.def:

Bootstrap: docker
From: tensorflow/tensorflow:2.14.0-gpu

%environment
    export PATH=${PATH}:/cm/local/apps/cuda/libs/current/bin
    export MODEL_NAME=model 
    export MODEL_BASE_PATH=/models

%files
    model /models/model

tensorflow_container.def:

Bootstrap: docker
From: tensorflow/serving:2.14.0-gpu

%environment
    export PATH=${PATH}:/cm/local/apps/cuda/libs/current/bin
    export MODEL_NAME=model 
    export MODEL_BASE_PATH=/models

%files
    model /models/model
  1. Build the model file to be served save_model.py:
    
    import tensorflow as tf
    from transformers import TFBartForConditionalGeneration, BartTokenizer
    import numpy as np

class MyOwnModel(tf.Module): def init(self, model_path="facebook/bart-large-cnn"): super(MyOwnModel, self).init() self.model = TFBartForConditionalGeneration.from_pretrained(model_path, no_repeat_ngram_size=None)

@tf.function(input_signature=[tf.TensorSpec(shape=[1, 1024], dtype=tf.int32, name="input_ids")]) 
def serving(self, input_ids):
    return self.model.generate(input_ids=input_ids)

model = MyOwnModel() export_dir = "./shaped_input_model"

tf.saved_model.save(model, export_dir, signatures={"serving_default": model.serving})


3. Run the tensorflow singularity container with `singularity run --nv -B shaped_input_model:/models/model/1 -B /usr/local/cuda-11.8:/usr/local/cuda-11.8 tensorflow_container.sif --per_process_gpu_memory_fraction=0.5`
4. Enter the base tensorflow inference with `singularity run --nv base_tensorflow_container.sif` and run inference using the python script

**infer.py:**

import tensorflow as tf from transformers import BartTokenizer import json import numpy as np import requests

article = "At least 14 people were killed and 60 others wounded Thursday when a bomb ripped through a crowd waiting to see Algeria's president in Batna, east of the capital of Algiers, the Algerie Presse Service reported. A wounded person gets first aid shortly after Thursday's attack in Batna, Algeria. The explosion occurred at 5 p.m. about 20 meters (65 feet) from a mosque in Batna, a town about 450 kilometers (280 miles) east of Algiers, security officials in Batna told the state-run news agency. The bomb went off 15 minutes before the expected arrival of President Abdel-Aziz Bouteflika. It wasn't clear if the bomb was caused by a suicide bomber or if it was planted, the officials said. Later Thursday, Algeria's Interior Minister Noureddine Yazid Zerhouni said \"a suspect person who was among the crowd attempted to go beyond the security cordon,\" but the person escaped \"immediately after the bomb exploded,\" the press service reported. Bouteflika made his visit to Batna as planned, adding a stop at a hospital to visit the wounded before he returned to the capital. There was no immediate claim of responsibility for the bombing. Algeria faces a continuing Islamic insurgency, according to the CIA. In July, 33 people were killed in apparent suicide bombings in Algiers that were claimed by an al Qaeda-affiliated group. Bouteflika said terrorist acts have nothing in common with the noble values of Islam, the press service reported. E-mail to a friend . CNN's Mohammed Tawfeeq contributed to this report."

tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn") url = "http://localhost:8501/v1/models/model:predict" MAX_SHAPE = 1024

inputs = tokenizer.encode(article, return_tensors="np") padding_length = 1024 - inputs.shape[1]

inputs = np.pad(inputs, ((0, 0), (0, padding_length)), mode='constant') print(inputs.shape)

inputs = inputs.tolist()

inputs = inputs[0]

json_data = json.dumps( { "signature_name": "serving_default", "inputs": inputs, } )

json_response = requests.post(url, data=json_data) response = json.loads(json_response.text) print(f"Summary: {response}")


5. Error occurs at this point

### Source code / logs
**output of infer.py**
```Summary: {'error': '2 root error(s) found.\n  (0) UNIMPLEMENTED: Could not find compiler for platform CUDA: NOT_FOUND: could not find registered compiler for platform CUDA -- was support for that platform linked in?\n\t [[{{function_node while_body_26758}}{{node while/XlaDynamicUpdateSlice}}]]\n\t [[StatefulPartitionedCall/StatefulPartitionedCall/while/body/_1058/while/tf_bart_for_conditional_generation/model/decoder/assert_less/Assert/Const_1/_1674]]\n  (1) UNIMPLEMENTED: Could not find compiler for platform CUDA: NOT_FOUND: could not find registered compiler for platform CUDA -- was support for that platform linked in?\n\t [[{{function_node while_body_26758}}{{node while/XlaDynamicUpdateSlice}}]]\n0 successful operations.\n0 derived errors ignored.'}```

**nvcc inside tensorflow_serving singularity container:**

Singularity> nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

Singularity> tensorflow_model_server --version TensorFlow ModelServer: 2.14.0-rc1 TensorFlow Library: 2.14.0

kmkolasinski commented 3 months ago

hi, the error suggests that there is an issue with some dynamics loops which can be implemented in the generate function (node while/XlaDynamicUpdateSlice). I believe the generate function is creating some dynamic tensors inside the loop which is not supported. XLA errors are hard to read sometimes.

singhniraj08 commented 3 months ago

@rb-23, Can you try passing --xla_cpu_compilation_enabled=true parameter as additional argument while running TF Serving docker image as shown here and see if model inference works. Please let us know if you face any issues, Thank you!

github-actions[bot] commented 2 months ago

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

rb-23 commented 2 months ago

@rb-23, Can you try passing --xla_cpu_compilation_enabled=true parameter as additional argument while running TF Serving docker image as shown here and see if model inference works. Please let us know if you face any issues, Thank you!

Hi @singhniraj08 , i tried doing as you suggested. Unfortunately, the same error comes out: 2024-04-30 14:37:45.277728: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at xla_compile_on_demand_op.cc:290 : UNIMPLEMENTED: Could not find compiler for platform CUDA: NOT_FOUND: could not find registered compiler for platform CUDA -- was support for that platform linked in?