kserve / modelmesh-serving

Controller for ModelMesh
Apache License 2.0
204 stars 114 forks source link

ONNX model serving And python GRPC client #333

Open MLHafizur opened 1 year ago

MLHafizur commented 1 year ago

I exported a pytorch (model.pt) model to ONNX:

def to_numpy(tensor):
    return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()
torch_model = torch.load(os.path.join(input_path, "<>.pt"), map_location="cpu")
torch_model.eval()
dataloader = joblib.load(os.path.join(input_path, "eval_dataloader.pkl"))
# pick a sample batch as sample input
batches = [batch for batch in dataloader]
_batch = batches[0]
# inputs needed for the model
b_input_ids = _batch[0].to('cpu')
b_input_mask = _batch[1].to('cpu')
# create features dict
features = dict(input_ids = b_input_ids, attention_mask = b_input_mask)
torch.onnx.export(torch_model, args=(features), f=os.path.join(input_path, "saved_onnx_model.onnx"), opset_version=15)

Deployed the model on ModelMesh successfully . Now trying to build python GRPC client:

df = pd.read_json(data, lines=True)
df = df.fillna('')
df_excluded = df[df['excluded'] != 0]
df = df[df['excluded'] == 0]
df = df[df['full_text'].apply(lambda x: isinstance(x, str))].reset_index(drop=True)
sentences = list(df['full_text'].values)

model_in = grpcclient.InferInput("TEXT", [len(sentences)], "BYTES")
model_in.set_data_from_numpy(np.array(sentences, dtype=object).reshape(len(sentences)))
inputs = [model_in]

print(inputs)

outputs = [grpcclient.InferRequestedOutput("top1_label"), grpcclient.InferRequestedOutput("top1_probas"), grpcclient.InferRequestedOutput("top2_label"), 
           grpcclient.InferRequestedOutput("top2_probas"), grpcclient.InferRequestedOutput("adverse"), grpcclient.InferRequestedOutput("adverse_probas")
]

print(outputs)

response = grpcclient.InferenceServerClient(url="localhost:8033").infer(
     "pipeline-poc-inference", inputs, request_id="1", outputs=outputs)

print(response.get_response())

With this script I am struggle to get the the output name, so getting the following error:

Traceback (most recent call last):
  File "inferencing.py", line 40, in <module>
    response = grpcclient.InferenceServerClient(url="localhost:8033").infer(
  File "/home/hafizur/miniconda3/envs/onnx/lib/python3.8/site-packages/tritonclient/grpc/__init__.py", line 1446, in infer
    raise_error_grpc(rpc_error)
  File "/home/hafizur/miniconda3/envs/onnx/lib/python3.8/site-packages/tritonclient/grpc/__init__.py", line 76, in raise_error_grpc
    raise get_error_grpc(rpc_error) from None
tritonclient.utils.InferenceServerException: [StatusCode.INVALID_ARGUMENT] inference.GRPCInferenceService/ModelInfer: INVALID_ARGUMENT: unexpected inference output 'adverse' for model 'pipeline-poc-inference__isvc-211152d1e7'

Am I heading to a proper way? Is there any way to get the correct output name the model is expecting? Thanks for your help!!

MLHafizur commented 1 year ago

@tjohnson31415 @njhill It is mentioned here https://github.com/kserve/modelmesh-serving/blob/main/docs/model-formats/onnx.md that the inputs and outputs of the model can be inferred from the model data. How can I infer them to use in the client script?

tjohnson31415 commented 1 year ago

A couple different ways that you can determine the inputs and outputs of the exported ONNX model:

  1. Use the ModelMetadata gRPC API to query Triton for info about the loaded model (could use get_model_metadata("pipeline-poc-inference") from the Triton client to send the request)

  2. Inspecting the ONNX model directly by loading it into memory in a python session/script:

    import onnx
    model = onnx.load('path/to/model.onnx')
    print(model.graph.input)
    print(model.graph.output)
MLHafizur commented 1 year ago

Thanks @tjohnson31415, the first option is doing well. Got the model metadata: platform: "onnxruntime_onnx"

inputs {
  name: "attention_mask"
  datatype: "INT64"
  shape: 32
  shape: 64
}
inputs {
  name: "input.1"
  datatype: "INT64"
  shape: 32
  shape: 64
}
outputs {
  name: "1643"
  datatype: "FP32"
  shape: 32
  shape: 16
}

But now struggling to figure out the input data type issue, although tried in many different ways:

Traceback (most recent call last):
  File "predict.py", line 97, in <module>
    response = grpc_stub.ModelInfer(request)
  File "/home/hafizur/miniconda3/envs/onnx/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/hafizur/miniconda3/envs/onnx/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.INVALID_ARGUMENT
        details = "inference.GRPCInferenceService/ModelInfer: INVALID_ARGUMENT: unexpected explicit tensor data for input tensor 'attention_mask' for model 'pipeline-poc-inference__isvc-211152d1e7' of type 'INT32', expected datatype 'INT64'"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:8033 {created_time:"2023-02-22T15:50:59.6900669-05:00", grpc_status:3, grpc_message:"inference.GRPCInferenceService/ModelInfer: INVALID_ARGUMENT: unexpected explicit tensor data for input tensor \'attention_mask\' for model \'pipeline-poc-inference__isvc-211152d1e7\' of type \'INT32\', expected datatype \'INT64\'"}"
The input datas are like:
[[  101  2424  2041 ...  1997  1037   102]
 [  101  1996  2343 ...  9228  2003   102]
 [  101  9710 22002 ...  1010  2256   102]
 ...
 [  101 26624  2139 ...  2055  4825   102]
 [  101  2508 22889 ...     0     0     0]
 [  101 10352 10958 ...  2053  2386   102]]
[[1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]
 [1 1 1 ... 1 1 1]
 ...
 [1 1 1 ... 1 1 1]
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 1 1 1]]
b_input_ids data type: int64
b_input_ids shape: (32, 64)
b_input_mask data type: int64
b_input_mask shape: (32, 64)

Here are the final codes I tried to run:

def to_numpy(tensor):
    return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

input_path = "."

# load the needed input (as needed by ONNX)
dataloader = joblib.load(os.path.join(input_path, "eval_dataloader.pkl"))

# pick a sample batch as sample input
batches = [batch for batch in dataloader]
_batch = batches[0]
# inputs needed for the model
#b_input_ids = _batch[0].to('cpu').long().to(torch.int64)
b_input_ids = _batch[0].to('cpu')
b_input_ids = to_numpy(b_input_ids)
#b_input_mask = _batch[1].to('cpu').long().to(torch.int64)
b_input_mask = _batch[1].to('cpu')
b_input_mask = to_numpy(b_input_mask)
print(b_input_ids)
print(b_input_mask)

print("b_input_ids data type:", b_input_ids.dtype)
print("b_input_ids shape:", b_input_ids.shape)
print("b_input_mask data type:", b_input_mask.dtype)
print("b_input_mask shape:", b_input_mask.shape)

# Send request to the server
grpc_channel = grpc.insecure_channel("localhost:8033")
grpc_stub = service_pb2_grpc.GRPCInferenceServiceStub(grpc_channel)

model_name = "pipeline-poc-inference"
model_version = ""

request = service_pb2.ModelMetadataRequest(name=model_name,
                                               version=model_version)
response = grpc_stub.ModelMetadata(request)
print("model metadata:\n{}".format(response))

#b_input_mask = b_input_mask.astype('int64')

#b_input_mask = b_input_mask.astype(np.int64)

b_input_mask = np.array(b_input_mask, dtype=np.int64)

# Infer
request = service_pb2.ModelInferRequest()
request.model_name = model_name
request.model_version = model_version
request.id = "my request id"

input0 = service_pb2.ModelInferRequest().InferInputTensor()
input0.name = "attention_mask"
# input0.datatype = "INT64"
# input0.shape.extend([32, 64])
# #input0.contents.int_contents[:] = b_input_mask
# #input0.contents.int_contents[:] = b_input_mask.tolist()
# #input0.contents.int_contents[:] = list(map(int, b_input_mask.tolist()))
# input0.contents.int_contents[:] = list(map(int, b_input_mask.ravel().tolist()))

input1 = service_pb2.ModelInferRequest().InferInputTensor()
input1.name = "input.1"
# input1.datatype = "INT64"
# input1.shape.extend([32, 64])
# #input0.contents.int_contents[: : ] = b_input_ids
# #input1.contents.int_contents[:] = b_input_ids.tolist()
# #input1.contents.int_contents[:] = list(map(int, b_input_ids.tolist()))
# input1.contents.int_contents[:] = list(map(int, b_input_ids.ravel().tolist()))

input0.datatype = "INT64"
input0.shape.extend([b_input_mask.shape[0], b_input_mask.shape[1]])
input0.contents.int_contents.extend(b_input_mask.ravel().tolist())

input1.datatype = "INT64"
input1.shape.extend([b_input_ids.shape[0], b_input_ids.shape[1]])
input1.contents.int_contents.extend(b_input_ids.ravel().tolist())

request.inputs.extend([input0, input1])

output0 = service_pb2.ModelInferRequest().InferRequestedOutputTensor()
output0.name = "1643"
request.outputs.extend([output0])

response = grpc_stub.ModelInfer(request)

print("response:\n{}".format(response))

What could be the issue?

tjohnson31415 commented 1 year ago

I think this next issue is that you are using input0.contents.int_contents.extend() instead of input0.contents.int64_contents.extend() (notice int vs int64). int_contents is for 32-bit integers (ref).

MLHafizur commented 1 year ago

Hi @tjohnson31415 I noticed and got the error. Thank you very much for your support. But strugling with an error from server: I am port-forwarding:

(base) hafizur@TOR-RAHMANHAFIZ:~$ kubectl port-forward service/modelmesh-serving 8033 -n modelmesh-serving-dev
Forwarding from 127.0.0.1:8033 -> 8033
Forwarding from [::1]:8033 -> 8033
Handling connection for 8033
Traceback (most recent call last):
  File "predict.py", line 47, in <module>
    response = grpc_stub.ModelMetadata(request)
  File "/home/hafizur/miniconda3/envs/onnx/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/hafizur/miniconda3/envs/onnx/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.INTERNAL
        details = "Nowhere available to load"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:8033 {grpc_message:"Nowhere available to load", grpc_status:13, created_time:"2023-02-22T18:40:56.3422454-05:00"}"
>

The Triton mmcontainer logs are :

"instant":{"epochSecond":1677108337,"nanoOfSecond":213884861},"thread":"ll-conn-retry-thread-1","level":"ERROR","loggerName":"com.ibm.watson.litelinks.client.ServiceInstance","message":"Failed to open new connection to 10.244.54.20:8080;v=20230111-f9487: com.ibm.watson.litelinks.TTimeoutException: opening new channel failed: /10.244.54.20:8080 (TIMED_OUT)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":272,"threadPriority":5}
{"instant":{"epochSecond":1677108343,"nanoOfSecond":569292479},"thread":"ll-conn-retry-thread-2","level":"ERROR","loggerName":"com.ibm.watson.litelinks.client.ServiceInstance","message":"Failed to open new connection to 10.244.41.8:8080;v=20230111-f9487: com.ibm.watson.litelinks.WTTransportException: opening new channel failed: /10.244.41.8:8080 (UNKNOWN)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":273,"threadPriority":5}
{"instant":{"epochSecond":1677108351,"nanoOfSecond":202538341},"thread":"invoke-ho-pipeline-poc-inference__isvc-211152d1e7","level":"WARN","loggerName":"com.ibm.watson.modelmesh.SidecarModelMesh","message":"Triggered \"cleanup\" unload for model pipeline-poc-inference__isvc-211152d1e7 after unexpected NOT_FOUND received from inference request","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":56,"threadPriority":5}
{"instant":{"epochSecond":1677108351,"nanoOfSecond":202657349},"thread":"invoke-ho-pipeline-poc-inference__isvc-211152d1e7","level":"ERROR","loggerName":"com.ibm.watson.modelmesh.SidecarModelMesh","message":"Error invoking inference.GRPCInferenceService/ModelMetadata method on model pipeline-poc-inference__isvc-211152d1e7: UNAVAILABLE: Request for unknown model: 'pipeline-poc-inference__isvc-211152d1e7' is not found","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":56,"threadPriority":5}
{"instant":{"epochSecond":1677108351,"nanoOfSecond":204890492},"thread":"invoke-ho-pipeline-poc-inference__isvc-211152d1e7","level":"WARN","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"ModelRuntime in instance c4dc88-c79rn returned unexpected NOT_FOUND for model pipeline-poc-inference__isvc-211152d1e7; purging from local cache and registration","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":56,"threadPriority":5}
{"instant":{"epochSecond":1677108364,"nanoOfSecond":941712037},"thread":"janitor-task","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Janitor registry pruning task took 2ms for 0/11 entries","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":40,"threadPriority":5}
{"instant":{"epochSecond":1677108374,"nanoOfSecond":42354088},"thread":"ll-conn-retry-thread-2","level":"ERROR","loggerName":"com.ibm.watson.litelinks.client.ServiceInstance","message":"Failed to open new connection to 10.244.41.8:8080;v=20230111-f9487: com.ibm.watson.litelinks.WTTransportException: opening new channel failed: /10.244.41.8:8080 (UNKNOWN)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":273,"threadPriority":5}
{"instant":{"epochSecond":1677108374,"nanoOfSecond":837546364},"thread":"ll-conn-retry-thread-1","level":"ERROR","loggerName":"com.ibm.watson.litelinks.client.ServiceInstance","message":"Failed to open new connection to 10.244.54.20:8080;v=20230111-f9487: com.ibm.watson.litelinks.TTimeoutException: opening new channel failed: /10.244.54.20:8080 (TIMED_OUT)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":272,"threadPriority":5}
{"instant":{"epochSecond":1677108375,"nanoOfSecond":706869303},"thread":"mm-task-thread-1","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Published new instance record: InstanceRecord [lruTime=never, count=0, capacity=113542, used=0 (0%), loc=10.214.4.65, zone=<none>, labels=[mt:keras, mt:keras:2, mt:onnx, mt:onnx:1, mt:pytorch, mt:pytorch:1, mt:tensorflow, mt:tensorflow:1, mt:tensorflow:2, mt:tensorrt, mt:tensorrt:7, pv:grpc-v2, pv:v2, rt:triton-2.x], startTime=1676346243353 (9 days ago), vers=0, loadThreads=2, loadInProg=0, reqsPerMin=0], UBW=1146, TUW=0, TCO=0","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":39,"threadPriority":5}
{"instant":{"epochSecond":1677108405,"nanoOfSecond":715101251},"thread":"ll-conn-retry-thread-2","level":"ERROR","loggerName":"com.ibm.watson.litelinks.client.ServiceInstance","message":"Failed to open new connection to 10.244.41.8:8080;v=20230111-f9487: com.ibm.watson.litelinks.WTTransportException: opening new channel failed: /10.244.41.8:8080 (UNKNOWN)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":273,"threadPriority":5}

Don't see any useful logs in triton and triton-adapter containers.

tjohnson31415 commented 1 year ago

This issue looks a bit tougher 🤔

The Failed to open new connection errors to port 8080 indicate a communication issue between the model mesh containers (port 8080 is what the mm containers use to communicate to each other). The unexpected NOT_FOUND errors indicates that mm thought the model was loaded, but the runtime reported that it wasn't. My hunch from these errors is that there is a networking issue and/or containers are restarting. Are there any restarts reported on the pods? If there are restarts, my guess would be that it is a memory issue, but you can check in the kubectl describe output to see what the exit code is reported as.

MLHafizur commented 1 year ago

The pod restared at the first time I tried, latter tries did not experience the restarts except the error.

modelmesh-serving-triton-2.x-54fbc4dc88-c79rn                     4/4     Running   1 (95m ago)   8d
modelmesh-serving-triton-2.x-54fbc4dc88-t2mlg                     4/4     Running   0             31h
tjohnson31415 commented 1 year ago

Hmm. It seems that you had the model loaded in and could use the ModelMetadata call, but then, when trying an inference call, the Triton container crashed and restarted? It is still worth looking in to that restart to confirm and see the reason. If OOMKill, then you should increase the memory allocation in the ServingRuntime.

Even with the runtime container restarting, ModelMesh should be able to recover and load the model in another pod, but I think the connection errors between mm pods is preventing that from happening (and I'm not sure why that is happening).

I would try:

  1. Delete the InferenceService
  2. Restart all runtime pods (kubectl rollout restart deployment modelmesh-serving-triton-2.x will do)
  3. Create the InferenceService
  4. Confirm the model loaded with a ModelMetadata call
  5. Try the inference again
  6. See if the same behavior of a restart followed by "Nowhere available to load" errors occurs
MLHafizur commented 1 year ago

Hi @tjohnson31415, Its interesting, I was able to get the model metadata yesterday, but with the same code and setup getting "Nowhere available to load" error:

def to_numpy(tensor):
    return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

input_path = "."

# load the needed input (as needed by ONNX)
dataloader = joblib.load(os.path.join(input_path, "eval_dataloader.pkl"))

# pick a sample batch as sample input
batches = [batch for batch in dataloader]
_batch = batches[0]
# inputs needed for the model
#b_input_ids = _batch[0].to('cpu').long().to(torch.int64)
b_input_ids = _batch[0].to('cpu')
b_input_ids = to_numpy(b_input_ids)
#b_input_mask = _batch[1].to('cpu').long().to(torch.int64)
b_input_mask = _batch[1].to('cpu')
b_input_mask = to_numpy(b_input_mask)

# Send request to the server
grpc_channel = grpc.insecure_channel("localhost:8033")
grpc_stub = service_pb2_grpc.GRPCInferenceServiceStub(grpc_channel)

model_name = "pipeline-poc-inference"
model_version = ""

request = service_pb2.ModelMetadataRequest(name=model_name,
                                               version=model_version)
response = grpc_stub.ModelMetadata(request)
print("model metadata:\n{}".format(response))

error:

Traceback (most recent call last):
  File "test.py", line 42, in <module>
    response = grpc_stub.ModelMetadata(request)
  File "/home/hafizur/miniconda3/envs/onnx/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/hafizur/miniconda3/envs/onnx/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.INTERNAL
        details = "Nowhere available to load"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:8033 {grpc_message:"Nowhere available to load", grpc_status:13, created_time:"2023-02-23T16:30:31.2467194-05:00"}"
njhill commented 1 year ago

@MLHafizur did you check whether the trition container crashed/restarted again? This is probably what the “ Nowhere available to load” indicates.

MLHafizur commented 1 year ago

@njhill The pods are up and running. No restart and crash. Also increased the resources. Restarted the etcd pod. nothing helped. We are blocked here. Any help would be appreciated.

tjohnson31415 commented 1 year ago

Hmm, interesting that the model worked before and cannot load at all now. Could you try to load one of the sample models (or another model that you know has worked for you) to see if the load/inference failures are particular to this model.

Another idea to try would be to create a copy of the InferenceService with a new name to see if that can load. This would check if there are internal references to the model that are not being cleaned up when it is deleted (which shouldn't happen, but 🤷).

If no models can load, it might be time to try a full re-install and see if this situation is reproducible.

tjohnson31415 commented 1 year ago

Hello @MLHafizur. Do you have any updates on this issue? Are you still experiencing the "Nowhere available to load" errors?

MLHafizur commented 1 year ago

Hi @tjohnson31415 Unfortunately the above solution did not work. Also It is not possible to re-install ModelMesh. So we are still getting the same error.