nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
1 stars 0 forks source link

Load Kubeflow Pipelines in RHODS Pipelines #156

Closed Zongshun96 closed 1 year ago

Zongshun96 commented 1 year ago

Description

It seems the Kubeflow Pipelines SDK cannot be used with RHODS Pipeline at the moment. The Kubeflow pipeline endpoint is not exposed.

Forwarding ds-pipeline-pipelines-definition service in my openshift project(ns) didn't solve the problem as my code (kpf_tekton with my bearer token, adapted from here) complaining certificate verify failed. Also it is not safe to simply forward the service.

Trevor Royer commented the problem could be that "the container likely has a cert built into it that is self signed so your cert verification fails."

Proposed Solution

Trevor suggested to add a route for the oauth port in the service, e.g., oc create route reencrypt --service=dsp-def-service --port=oauth. He mentioned "you can setup a route and the cluster will create a new cert that is already trusted for you."

Screen Shot 2023-06-30 at 5 31 58 PM

He also mentioned some workarounds. While I think we need a permenant fix to this problem, I am listing them here for record.

Reproducibility

kfp_tekton~=1.5.0
https://github.com/adrien-legros/rhods-mnist/blob/main/docs/lab-instructions.md#add-a-pipeline-step
https://github.com/rh-datascience-and-edge-practice/kubeflow-examples/blob/main/pipelines/1_test_connection.py

Notes

Zongshun96 commented 1 year ago

Now we need to figure out how to correctly configure the TLS params for the routes.

Following tutorial (https://www.redhat.com/sysadmin/cert-manager-operator-openshift), we make sure that the cert-manager is running and I was able to create an issuer in my namespace.

Then we try to follow this tutorial to manually generate a certificate and add the secret to my route. There were two issues here. One I cannot create a certificate using the ClusterIssuer the certificate stays in issuing stage forever. Second, although I can create a certificate with the issuer in my namespace, after adding the TLS information to my route based on the secret generated by the certificate, I still saw the same certificate verify failed.

Try to create ClusterIssuer

Screen Shot 2023-07-13 at 4 14 13 PM Screen Shot 2023-07-13 at 4 14 20 PM

Using Issuer in my namespace

Screen Shot 2023-07-13 at 4 15 27 PM Screen Shot 2023-07-13 at 3 19 44 PM Screen Shot 2023-07-13 at 3 22 22 PM Screen Shot 2023-07-13 at 3 23 04 PM

Next Attempt

Can we apply this cert-manager-openshift-routes in the test cluster and it should generate the certificate and populate my routes automatically.

Zongshun96 commented 1 year ago

It turns out that without the cert-manager-openshift-routes we can still manualy setup certificate for routes and connect to KPF endpoint. While I think having this be automatically done in the test cluster would be nice, I have the steps to do so manually below. Thank you for all the helps from Trevor and Dylan!

Steps

  1. Generate Certificate with ClusterIssuer (named selfsigned in NERC test cluster). It will create the corresponding secret in the same namespace/project. Screen Shot 2023-07-14 at 3 51 37 PM Screen Shot 2023-07-14 at 3 52 16 PM

  2. Copy & paste the cert and private key to your route. Configure your route with spec.tls.termination: reencrypt Screen Shot 2023-07-14 at 3 54 45 PM

  3. Add certificate to your kfp client in python (kfp_tekton==1.5.0). Note: with python3.10 there are issues with versions of pyyaml and urllib3. Please try python3 -m pip install kfp_tekton==1.5.0 pyyaml==5.3.1 urllib3==1.26.15 requests-toolbelt==0.10.1 kubernetes

    client = kfp_tekton.TektonClient(
    host=kubeflow_endpoint,
    existing_token=bearer_token,
    ssl_ca_cert = '/home/ubuntu/Praxi-Pipeline/ca.crt'
    )
  4. Now, your kfp client should be able to connect to the kfp endpoint. Screen Shot 2023-07-14 at 4 10 25 PM https://github.com/rh-datascience-and-edge-practice/kubeflow-examples/blob/main/pipelines/1_test_connection.py

dystewart commented 1 year ago

In the meantime we should document this as a solution. I also agree it would be nice to have an automated way of doing this. Nice work @Zongshun96!

Zongshun96 commented 1 year ago

Problem Description

I am facing a new error when deploying a kfp pipeline with intermediate data. The PVC is mounted to a volume but the container cannot mount that volume. It seems the cephfs/rbd Plugin pod is not working correctly(?). https://github.com/rook/rook/issues/4896#issuecomment-610186299

Reproducibility 1

Deploying the pipeline below. https://github.com/rh-datascience-and-edge-practice/kubeflow-examples/blob/0b3b0f837b1b7ea988e0c9242ca016dfec9f2bd6/pipelines/11_iris_training_pipeline.py

Error Logs

Screen Shot 2023-07-19 at 12 47 26 AM Screen Shot 2023-07-19 at 12 47 04 AM

Reproducibility 2

Deploying a single busybox container pod with a PVC also shows the same error.

interm-pvc.yaml

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: interm-pvc
  namespace: praxi 
  # labels:
  #   app: snapshot
spec:
  storageClassName: ocs-external-storagecluster-ceph-rbd
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

fake-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: snapshot-fake-deployment
  namespace: praxi 
spec:
  replicas: 1
  selector:
    matchLabels:
      app: snapshot
  template:
    metadata:
      labels:
        app: snapshot
    spec:
      containers:
        - name: snapshot-fake
          image: busybox:latest
          imagePullPolicy: "IfNotPresent"
          command: [ "/bin/bash", "-c", "--" ]
          args: [ "while true; do sleep 30; done;" ]
          volumeMounts:
            - mountPath: /fake-snapshot
              name: snapshot-vol1
      volumes:
        - name: snapshot-vol1
          persistentVolumeClaim:
            claimName: interm-pvc

Error Logs

Screen Shot 2023-07-19 at 1 00 40 AM Screen Shot 2023-07-19 at 1 02 50 AM
Zongshun96 commented 1 year ago

The storage class issue was fixed by enforce node affinity to avoid those newly introduced GPU nodes. There seemst to be some permissions haven't been setup. https://github.com/OCP-on-NERC/operations/issues/170

For now my fix is to apply the node affinity to my components. The following is an example to add_affinity to generate_loadmod_op component.

    # create affinity objects
    terms = kubernetes.client.models.V1NodeSelectorTerm(
        match_expressions=[
            {'key': 'kubernetes.io/hostname',
            'operator': 'NotIn',
            'values': ["wrk-10", "wrk-11"]}
        ]
    )
    node_selector = kubernetes.client.models.V1NodeSelector(node_selector_terms=[terms])
    node_affinity = kubernetes.client.models.V1NodeAffinity(
        required_during_scheduling_ignored_during_execution=node_selector
    )
    affinity = kubernetes.client.models.V1Affinity(node_affinity=node_affinity)

    model = generate_loadmod_op().apply(use_image_pull_policy()).add_affinity(affinity)

A working pipeline is shown here. https://github.com/ai4cloudops/Praxi-Pipeline/blob/7fac19b79ac56f41b098d5adb380a510038f3ddf/Praxi-Pipeline-xgb.py

Some useful pointers to recall

kfp_tekton~=1.5.0
https://github.com/adrien-legros/rhods-mnist/blob/main/docs/lab-instructions.md#add-a-pipeline-step
https://github.com/rh-datascience-and-edge-practice/kubeflow-examples/tree/0b3b0f837b1b7ea988e0c9242ca016dfec9f2bd6/pipelines
https://github.com/cert-manager/openshift-routes#usage

Thank you!

Zongshun96 commented 1 year ago

It seems the AWS access key in mlpipeline-minio-artifact secret won't update automatically to reflect changes (new AWS access key and secret) in the data connection. This problem causes pods to fail with the following error.

ubuntu@test-retrieving-logs:~/Praxi-Pipeline$ oc logs submitted-pipeline-4c585-generate-changesets-pod -c step-copy-artifacts
tar: Removing leading `/' from member names
/tekton/home/tep-results/args
upload failed: ./args.tgz to s3://rhods-data-connection/artifacts/submitted-pipeline-4c585/generate-changesets/args.tgz An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.
tar: Removing leading `/' from member names
/tekton/home/tep-results/cs
upload failed: ./cs.tgz to s3://rhods-data-connection/artifacts/submitted-pipeline-4c585/generate-changesets/cs.tgz An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.

The solution is to manully update the mlpipeline-minio-artifact secret with new AWS access key and secret. This can be done through openshift console.