Zongshun96 commented 1 year ago

Description

It seems the Kubeflow Pipelines SDK cannot be used with RHODS Pipeline at the moment. The Kubeflow pipeline endpoint is not exposed.

Forwarding ds-pipeline-pipelines-definition service in my openshift project(ns) didn't solve the problem as my code (kpf_tekton with my bearer token, adapted from here) complaining certificate verify failed. Also it is not safe to simply forward the service.

Trevor Royer commented the problem could be that "the container likely has a cert built into it that is self signed so your cert verification fails."

Proposed Solution

Trevor suggested to add a route for the oauth port in the service, e.g., oc create route reencrypt --service=dsp-def-service --port=oauth. He mentioned "you can setup a route and the cluster will create a new cert that is already trusted for you."

He also mentioned some workarounds. While I think we need a permenant fix to this problem, I am listing them here for record.

"You can extract the cert from the container and add it to your trusted certs list on your machine"
"kpf_tekton may provide an option to allow you to connect without authenticating the cert"

Reproducibility

kfp_tekton~=1.5.0
https://github.com/adrien-legros/rhods-mnist/blob/main/docs/lab-instructions.md#add-a-pipeline-step
https://github.com/rh-datascience-and-edge-practice/kubeflow-examples/blob/main/pipelines/1_test_connection.py

Notes

Please manually create the S3 bucket for your Pipeline Server. (Check oc describe <pod> and AWS Cloudtrail)
"The other gotcha that I have found is that in upstream you need to point kfp to <route>/pipelines and in dsp you will need to use just the route. No /pipelines"
"I think when you create a DSP instance it creates a pod called something like data-science-pipelines-defenition and the route for the API endpoint will be the one pointing towards that"
"So basic rule of thumb is anything for connecting to the dsp server or compiling you want to use kfp_tekton. Any normal pipeline defenition pieces use kfp."

Zongshun96 commented 1 year ago

Now we need to figure out how to correctly configure the TLS params for the routes.

Following tutorial (https://www.redhat.com/sysadmin/cert-manager-operator-openshift), we make sure that the cert-manager is running and I was able to create an issuer in my namespace.

Then we try to follow this tutorial to manually generate a certificate and add the secret to my route. There were two issues here. One I cannot create a certificate using the ClusterIssuer the certificate stays in issuing stage forever. Second, although I can create a certificate with the issuer in my namespace, after adding the TLS information to my route based on the secret generated by the certificate, I still saw the same certificate verify failed.

Try to create `ClusterIssuer`

Screen Shot 2023-07-13 at 4 14 13 PM

Using `Issuer` in my namespace

Screen Shot 2023-07-13 at 4 15 27 PM Screen Shot 2023-07-13 at 3 19 44 PM Screen Shot 2023-07-13 at 3 22 22 PM Screen Shot 2023-07-13 at 3 23 04 PM

Next Attempt

Can we apply this cert-manager-openshift-routes in the test cluster and it should generate the certificate and populate my routes automatically.

Zongshun96 commented 1 year ago

It turns out that without the cert-manager-openshift-routes we can still manualy setup certificate for routes and connect to KPF endpoint. While I think having this be automatically done in the test cluster would be nice, I have the steps to do so manually below. Thank you for all the helps from Trevor and Dylan!

Steps

Generate Certificate with ClusterIssuer (named selfsigned in NERC test cluster). It will create the corresponding secret in the same namespace/project.
Copy & paste the cert and private key to your route. Configure your route with spec.tls.termination: reencrypt
Add certificate to your kfp client in python (kfp_tekton==1.5.0). Note: with python3.10 there are issues with versions of pyyaml and urllib3. Please try python3 -m pip install kfp_tekton==1.5.0 pyyaml==5.3.1 urllib3==1.26.15 requests-toolbelt==0.10.1 kubernetes
```
client = kfp_tekton.TektonClient(
host=kubeflow_endpoint,
existing_token=bearer_token,
ssl_ca_cert = '/home/ubuntu/Praxi-Pipeline/ca.crt'
)
```
Now, your kfp client should be able to connect to the kfp endpoint. https://github.com/rh-datascience-and-edge-practice/kubeflow-examples/blob/main/pipelines/1_test_connection.py

dystewart commented 1 year ago

In the meantime we should document this as a solution. I also agree it would be nice to have an automated way of doing this. Nice work @Zongshun96!

Zongshun96 commented 1 year ago

Problem Description

I am facing a new error when deploying a kfp pipeline with intermediate data. The PVC is mounted to a volume but the container cannot mount that volume. It seems the cephfs/rbd Plugin pod is not working correctly(?). https://github.com/rook/rook/issues/4896#issuecomment-610186299

Reproducibility 1

Deploying the pipeline below. https://github.com/rh-datascience-and-edge-practice/kubeflow-examples/blob/0b3b0f837b1b7ea988e0c9242ca016dfec9f2bd6/pipelines/11_iris_training_pipeline.py

Error Logs

Reproducibility 2

Deploying a single busybox container pod with a PVC also shows the same error.

interm-pvc.yaml

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: interm-pvc
  namespace: praxi 
  # labels:
  #   app: snapshot
spec:
  storageClassName: ocs-external-storagecluster-ceph-rbd
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

fake-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: snapshot-fake-deployment
  namespace: praxi 
spec:
  replicas: 1
  selector:
    matchLabels:
      app: snapshot
  template:
    metadata:
      labels:
        app: snapshot
    spec:
      containers:
        - name: snapshot-fake
          image: busybox:latest
          imagePullPolicy: "IfNotPresent"
          command: [ "/bin/bash", "-c", "--" ]
          args: [ "while true; do sleep 30; done;" ]
          volumeMounts:
            - mountPath: /fake-snapshot
              name: snapshot-vol1
      volumes:
        - name: snapshot-vol1
          persistentVolumeClaim:
            claimName: interm-pvc

Error Logs

Zongshun96 commented 1 year ago

The storage class issue was fixed by enforce node affinity to avoid those newly introduced GPU nodes. There seemst to be some permissions haven't been setup. https://github.com/OCP-on-NERC/operations/issues/170

For now my fix is to apply the node affinity to my components. The following is an example to add_affinity to generate_loadmod_op component.

    # create affinity objects
    terms = kubernetes.client.models.V1NodeSelectorTerm(
        match_expressions=[
            {'key': 'kubernetes.io/hostname',
            'operator': 'NotIn',
            'values': ["wrk-10", "wrk-11"]}
        ]
    )
    node_selector = kubernetes.client.models.V1NodeSelector(node_selector_terms=[terms])
    node_affinity = kubernetes.client.models.V1NodeAffinity(
        required_during_scheduling_ignored_during_execution=node_selector
    )
    affinity = kubernetes.client.models.V1Affinity(node_affinity=node_affinity)

    model = generate_loadmod_op().apply(use_image_pull_policy()).add_affinity(affinity)

A working pipeline is shown here. https://github.com/ai4cloudops/Praxi-Pipeline/blob/7fac19b79ac56f41b098d5adb380a510038f3ddf/Praxi-Pipeline-xgb.py

Some useful pointers to recall

kfp_tekton~=1.5.0
https://github.com/adrien-legros/rhods-mnist/blob/main/docs/lab-instructions.md#add-a-pipeline-step
https://github.com/rh-datascience-and-edge-practice/kubeflow-examples/tree/0b3b0f837b1b7ea988e0c9242ca016dfec9f2bd6/pipelines
https://github.com/cert-manager/openshift-routes#usage

Thank you!

Zongshun96 commented 1 year ago

It seems the AWS access key in mlpipeline-minio-artifact secret won't update automatically to reflect changes (new AWS access key and secret) in the data connection. This problem causes pods to fail with the following error.

ubuntu@test-retrieving-logs:~/Praxi-Pipeline$ oc logs submitted-pipeline-4c585-generate-changesets-pod -c step-copy-artifacts
tar: Removing leading `/' from member names
/tekton/home/tep-results/args
upload failed: ./args.tgz to s3://rhods-data-connection/artifacts/submitted-pipeline-4c585/generate-changesets/args.tgz An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.
tar: Removing leading `/' from member names
/tekton/home/tep-results/cs
upload failed: ./cs.tgz to s3://rhods-data-connection/artifacts/submitted-pipeline-4c585/generate-changesets/cs.tgz An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.

The solution is to manully update the mlpipeline-minio-artifact secret with new AWS access key and secret. This can be done through openshift console.

nerc-project / operations

Load Kubeflow Pipelines in RHODS Pipelines #156

Description

Proposed Solution

Reproducibility

Try to create `ClusterIssuer`

Using `Issuer` in my namespace

Next Attempt

Steps

Problem Description

Reproducibility 1

Error Logs

Reproducibility 2

Error Logs

nerc-project / operations

Load Kubeflow Pipelines in RHODS Pipelines #156

Description

Proposed Solution

Reproducibility

Try to create ClusterIssuer

Using Issuer in my namespace

Next Attempt

Steps

Problem Description

Reproducibility 1

Error Logs

Reproducibility 2

Error Logs

Try to create `ClusterIssuer`

Using `Issuer` in my namespace