nv-morpheus / Morpheus

Morpheus SDK
Apache License 2.0
310 stars 119 forks source link

[BUG]: MLflow won't load phishing-bert-onnx model #1726

Open nvawood opened 1 month ago

nvawood commented 1 month ago

Version

24.03

Which installation method(s) does this occur on?

Kubernetes

Describe the bug.

Deployed Morpheus via NGC using helm charts. Unable to deploy phishing-bert-onnx model.

Minimum reproducible example

export API_KEY="<NGC KEY>"
export NAMESPACE="morpheus"
export RELEASE="testing"

helm fetch https://helm.ngc.nvidia.com/nvidia/morpheus/charts/morpheus-ai-engine-24.03.tgz --username='$oauthtoken' --password=${API_KEY} --untar
helm fetch https://helm.ngc.nvidia.com/nvidia/morpheus/charts/morpheus-mlflow-24.03.tgz --username='$oauthtoken' --password=${API_KEY} --untar
helm fetch https://helm.ngc.nvidia.com/nvidia/morpheus/charts/morpheus-sdk-client-24.03.tgz --username='$oauthtoken' --password=${API_KEY} --untar

helm install --set ngc.apiKey="${API_KEY}" --namespace "${NAMESPACE}" "${RELEASE}-engine" morpheus-ai-engine
helm install --set ngc.apiKey="${API_KEY}" --namespace "${NAMESPACE}" "${RELEASE}-helper" morpheus-sdk-client

(when Running)

kubectl -n "${NAMESPACE}" exec "sdk-cli-${RELEASE}-helper" -- cp -RL /workspace/models /common

helm install --set ngc.apiKey="${API_KEY}" --namespace "${NAMESPACE}" "${RELEASE}-mlflow" morpheus-mlflow

kubectl -n ${NAMESPACE} exec -it deploy/mlflow -- bash

python publish_model_to_mlflow.py \
      --model_name phishing-bert-onnx \
      --model_directory /common/models/triton-model-repo/phishing-bert-onnx \
      --flavor triton

mlflow deployments create -t triton \
      --flavor triton \
      --name phishing-bert-onnx \
      -m models:/phishing-bert-onnx/1 \
      -C "version=1"

Relevant log output

Triton Logs

E0604 16:12:24.504123 1 model_repository_manager.cc:1335] Poll failed for model directory 'phishing-bert-onnx': Invalid model name: Could not determine backend for model 'phishing-bert-onnx' with no backend in model configuration. Expected model name of the form 'model.'.

Deployment Creation Logs

Successfully registered model 'phishing-bert-onnx'.
Created version '1' of model 'phishing-bert-onnx'.
/mlflow/artifacts/0/4281c565f9ef489880c9940e35992f54/artifacts
Saved mlflow-meta.json to /common/triton-model-repo/phishing-bert-onnx
Traceback (most recent call last):
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/mlflow_triton/deployments.py", line 115, in create_deployment
    self.triton_client.load_model(name)
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/tritonclient/http/_client.py", line 669, in load_model
    _raise_if_error(response)
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/tritonclient/http/_utils.py", line 69, in _raise_if_error
    raise error
tritonclient.utils.InferenceServerException: [500] failed to load 'phishing-bert-onnx', failed to poll from model repository

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/mlflow/bin/mlflow", line 8, in 
    sys.exit(cli())
             ^^^^^
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/mlflow/deployments/cli.py", line 151, in create_deployment
    deployment = client.create_deployment(name, model_uri, flavor, config=config_dict)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/mlflow/lib/python3.11/site-packages/mlflow_triton/deployments.py", line 117, in create_deployment
    raise MlflowException(str(ex))
mlflow.exceptions.MlflowException: [500] failed to load 'phishing-bert-onnx', failed to poll from model repository

Full env printout

No response

Other/Misc.

No response

Code of Conduct