opendatahub-io / modelmesh-serving

Controller for ModelMesh
Apache License 2.0
3 stars 32 forks source link

[RHOAIENG-1051] - update opendatahub folder to kustomize #280

Closed spolti closed 6 months ago

spolti commented 8 months ago

chore: opendatahub folder have

How to test:

PR checklist

Checklist items below are applicable for development targeted to both fast and stable branches/tags

Checklist items below are applicable for development targeted to both fast and stable branches/tags

Jooho commented 7 months ago

@spolti I followed the description and hit errors:

./deploy.sh: line 4: /tmp/modelmesh-serving/opendatahub/quickstart/scripts/utils.sh: No such file or directory
make: *** No rule to make target 'deploy-mm-for-odh'.  Stop.
Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
No resources found
Error from server (NotFound): namespaces "model-serving" not found
Now using project "model-serving" on server "https://api.jooho.n1ai.p3.openshiftapps.com:443".

You can add applications to this project with the 'new-app' command. For example, try:

    oc new-app rails-postgresql-example

to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application:

    kubectl create deployment hello-node --image=registry.k8s.io/e2e-test-images/agnhost:2.43 -- /agnhost serve-hostname

error: the path "/tmp/modelmesh-serving/opendatahub/quickstart/basic/common_manifests/openvino-serving-runtime.yaml" does not exist
error: the path "/tmp/modelmesh-serving/opendatahub/quickstart/basic/common_manifests/openvino-inference-service.yaml" does not exist
./deploy.sh: line 28: wait_for_pods_ready: command not found
./deploy.sh: line 30: success: command not found
Jooho commented 7 months ago

@spolti Basic folder is only one to test or are there any other folders?

spolti commented 7 months ago

@Jooho it should be working now, not sure why it happened, but was able to run previously locally.

Jooho commented 7 months ago

now, it removed the modelmesh-serving folder itself.

  [RHOAIENG-1051](https://issues.redhat.com//browse/RHOAIENG-1051)  /tmp/modelmesh-serving/opendatahub/quickstart/basic                                                                   15:57:06  jooho 
❯ ./deploy.sh 
./opendatahub/scripts/install_odh.sh 
.. Downloading binaries
.. Creating a bin folder
yq already installed.
Installing kustomize.
curl: (22) The requested URL returned error: 404
tar: /tmp/kustomize.tar.gz: Cannot open: No such file or directory
tar: Error is not recoverable: exiting now
mv: cannot stat '/tmp/kustomize': No such file or directory
rm: cannot remove '/tmp/kustomize.tar.gz': No such file or directory
installed kustomize version v5.3.0
Delete the exising /tmp/modelmesh-e2e folder
Creating a /tmp/modelmesh-e2e folder
Already on project "opendatahub" on server "https://api.jooho.n1ai.p3.openshiftapps.com:443".
.. Archiving odh-manifests
cp: cannot stat '/tmp/modelmesh-serving/opendatahub/scripts/../../config': No such file or directory
cp: cannot stat '/tmp/modelmesh-serving/opendatahub/scripts/manifests': No such file or directory
Latest Manifest will be used, fast tag
.. Deploying ModelMesh with kustomize
./opendatahub/scripts/install_odh.sh: line 158: /tmp/modelmesh-serving/config/overlays/odh/params.env: No such file or directory
params.env:
cat: /tmp/modelmesh-serving/config/overlays/odh/params.env: No such file or directory

installing namespaced rbac
2024/03/19 15:57:09 unable to make loader at '.'; not a valid directory: abs path error on '.' : getwd: no such file or directory
error: no objects passed to apply
cat: /tmp/modelmesh-serving/config/namespace-runtimes/kustomization.yaml: No such file or directory
2024/03/19 15:57:09 unable to make loader at '.'; not a valid directory: abs path error on '.' : getwd: no such file or directory
^Cmake: *** [Makefile:297: deploy-mm-for-odh] Interrupt

cd: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
cd: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
cd: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
cd: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
cd: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
cd: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory

please double check rm command.

Jooho commented 6 months ago

/retest

@spolti With this pr, what should I test? 3 quickstarts ?

anything else?

spolti commented 6 months ago

only these 3 should be fine.

Jooho commented 6 months ago

basic/hpa worked but pvc failed.

modelmesh-controller failed

{"level":"info","ts":"2024-04-02T20:12:53Z","logger":"setup","msg":"MMesh Configuration","serviceName":"modelmesh-serving","port":8033,"mmeshEndpoint":""}
{"level":"error","ts":"2024-04-02T20:12:53Z","logger":"setup","msg":"In cluster scope mode but controller does not have cluster scope permissions, exiting","stacktrace":"main.main\n\t/go/src/github.com/opendatahub-io/modelmesh-serving/main.go:210\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:250"}
VedantMahabaleshwarkar commented 6 months ago

Is this quickstart still deploying model-mesh? The current issue with these set of quickstarts, in my opinion, is that the deploy.sh script looks to install model-mesh, which is not what I have found I needed whenever I looked to these quickstarts. Most of the time I find myself looking at these quickstarts, I already have an ODH installation, and what I'm looking for is a 1 line command to deploy a sample inferenceservice. IMO, these quickstarts should operate under the assumption that ODH is already installed, and only focus on deploying the models and performing inference. Or, at least there should be an option for this in addition to what already exists.

WDYT @Jooho @israel-hdez @spolti @danielezonca

spolti commented 6 months ago

FVT is working fine, just one test is failing, seems to be a issue with it:

Scaling of runtime deployments with HPA Autoscaler when there are no predictors Scale all runtimes down after a created test predictor is deleted
/Users/fspolti/data/dev/sources/modelmesh-serving/fvt/hpa/hpa_test.go:149
  2024-04-05T17:45:14-03:00 INFO    Delete all predictors ...
  2024-04-05T17:45:17-03:00 INFO    Watcher got event with object   {"name": "modelmesh-serving-mlserver-1.x", "replicas": 0, "available": 0, "updated": 0}
  2024-04-05T17:45:17-03:00 INFO    deployStatusesReady: map[modelmesh-serving-mlserver-1.x:true modelmesh-serving-ovms-1.x:false modelmesh-serving-torchserve-0.x:false modelmesh-serving-triton-2.x:false]
  2024-04-05T17:45:17-03:00 INFO    Watcher got event with object   {"name": "modelmesh-serving-ovms-1.x", "replicas": 0, "available": 0, "updated": 0}
  2024-04-05T17:45:17-03:00 INFO    deployStatusesReady: map[modelmesh-serving-mlserver-1.x:true modelmesh-serving-ovms-1.x:true modelmesh-serving-torchserve-0.x:false modelmesh-serving-triton-2.x:false]
  2024-04-05T17:45:17-03:00 INFO    Watcher got event with object   {"name": "modelmesh-serving-torchserve-0.x", "replicas": 0, "available": 0, "updated": 0}
  2024-04-05T17:45:17-03:00 INFO    deployStatusesReady: map[modelmesh-serving-mlserver-1.x:true modelmesh-serving-ovms-1.x:true modelmesh-serving-torchserve-0.x:true modelmesh-serving-triton-2.x:false]
  2024-04-05T17:45:17-03:00 INFO    Watcher got event with object   {"name": "modelmesh-serving-triton-2.x", "replicas": 0, "available": 0, "updated": 0}
  2024-04-05T17:45:17-03:00 INFO    deployStatusesReady: map[modelmesh-serving-mlserver-1.x:true modelmesh-serving-ovms-1.x:true modelmesh-serving-torchserve-0.x:true modelmesh-serving-triton-2.x:true]
  2024-04-05T17:45:17-03:00 INFO    All deployments are ready: map[modelmesh-serving-mlserver-1.x:true modelmesh-serving-ovms-1.x:true modelmesh-serving-torchserve-0.x:true modelmesh-serving-triton-2.x:true]
  2024-04-05T17:45:27-03:00 INFO    Timed out after 10s without events
  STEP: Creating a test predictor for one Runtime @ 04/05/24 17:45:27.361
  STEP: Creating predictor mlserver-sklearn-mnist-svm-gpqgm @ 04/05/24 17:45:27.361
  STEP: Waiting for predictor mlserver-sklearn-mnist-svm-gpqgm to be 'Loaded' @ 04/05/24 17:45:27.665
  2024-04-05T17:45:27-03:00 INFO    Watcher got event with object   {"name": "mlserver-sklearn-mnist-svm-gpqgm", "status.available": false, "status.activeModelState": "Pending", "status.targetModelState": "", "status.transitionStatus": "UpToDate", "status.lastFailureInfo": null}
  2024-04-05T17:45:27-03:00 INFO    Watcher got event with object   {"name": "mlserver-sklearn-mnist-svm-gpqgm", "status.available": false, "status.activeModelState": "Pending", "status.targetModelState": "", "status.transitionStatus": "UpToDate", "status.lastFailureInfo": {"message":"Waiting for runtime Pod to become available","modelId":"mlserver-sklearn-mnist-svm-gpqgm__ksp-b20a0c5aca","reason":"RuntimeUnhealthy"}}

  [FAILED] in [It] - /Users/fspolti/data/dev/sources/modelmesh-serving/fvt/helpers.go:355 @ 04/05/24 17:47:27.664
  2024-04-05T17:47:27-03:00 INFO    Running command {"args": "kubectl get predictors -n model-serving"}
  =====================================================================================================================================
  NAME                               TYPE      AVAILABLE   ACTIVEMODEL   TARGETMODEL   TRANSITION   AGE
  mlserver-sklearn-mnist-svm-gpqgm   sklearn   false       Pending                     UpToDate     2m3s
israel-hdez commented 6 months ago

Is this quickstart still deploying model-mesh? The current issue with these set of quickstarts, in my opinion, is that the deploy.sh script looks to install model-mesh, which is not what I have found I needed whenever I looked to these quickstarts. Most of the time I find myself looking at these quickstarts, I already have an ODH installation, and what I'm looking for is a 1 line command to deploy a sample inferenceservice. IMO, these quickstarts should operate under the assumption that ODH is already installed, and only focus on deploying the models and performing inference. Or, at least there should be an option for this in addition to what already exists.

WDYT @Jooho @israel-hdez @spolti @danielezonca

Note that I haven't reviewed the code and I think I don't have full context. So, just trying to answer...

In most projects I have played with, quickstarts assume a clean environment and install everything to quickly give you a working setup. Also, usually quickstarts are for trying the project (i.e. non-production, demos). Because of this, a quickstart don't let you customize the setup (that's left to the official installer).

I would agree that deploying a sample model should be left to a different script than the quickstart setup, although it could be part of the same doc page (probably arranged like a tutorial).

That said, I also didn't use too much the quickstarts because what I personally remember is that, rather than providing you with a setup that you can play with, I think it prepared the env more like a demo that is also suited for running FVTs/CI... and I had to spend time "cleaning" the env. This is different from your case: you just want to deploy a sample model on an existent setup, while I wanted to quickly have a base setup (without additional stuff) to try my own models. ... but don't trust me about this (I may be remembering incorrectly why I didn't use the quickstarts that much).

openshift-ci[bot] commented 6 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jooho, spolti

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/opendatahub-io/modelmesh-serving/blob/main/OWNERS)~~ [Jooho,spolti] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment