Closed Zongshun96 closed 1 year ago
Now we need to figure out how to correctly configure the TLS params for the routes.
Following tutorial (https://www.redhat.com/sysadmin/cert-manager-operator-openshift), we make sure that the cert-manager is running and I was able to create an issuer
in my namespace.
Then we try to follow this tutorial to manually generate a certificate and add the secret to my route. There were two issues here. One I cannot create a certificate using the ClusterIssuer
the certificate stays in issuing
stage forever. Second, although I can create a certificate with the issuer in my namespace, after adding the TLS information to my route based on the secret generated by the certificate, I still saw the same certificate verify failed
.
ClusterIssuer
Issuer
in my namespace
Can we apply this cert-manager-openshift-routes
in the test cluster and it should generate the certificate and populate my routes automatically.
It turns out that without the cert-manager-openshift-routes we can still manualy setup certificate for routes and connect to KPF endpoint. While I think having this be automatically done in the test cluster would be nice, I have the steps to do so manually below. Thank you for all the helps from Trevor and Dylan!
Generate Certificate with ClusterIssuer
(named selfsigned
in NERC test cluster). It will create the corresponding secret in the same namespace/project.
Copy & paste the cert and private key to your route. Configure your route with spec.tls.termination: reencrypt
Add certificate to your kfp client in python (kfp_tekton==1.5.0
).
Note: with python3.10 there are issues with versions of pyyaml and urllib3. Please try python3 -m pip install kfp_tekton==1.5.0 pyyaml==5.3.1 urllib3==1.26.15 requests-toolbelt==0.10.1 kubernetes
client = kfp_tekton.TektonClient(
host=kubeflow_endpoint,
existing_token=bearer_token,
ssl_ca_cert = '/home/ubuntu/Praxi-Pipeline/ca.crt'
)
Now, your kfp client should be able to connect to the kfp endpoint. https://github.com/rh-datascience-and-edge-practice/kubeflow-examples/blob/main/pipelines/1_test_connection.py
In the meantime we should document this as a solution. I also agree it would be nice to have an automated way of doing this. Nice work @Zongshun96!
I am facing a new error when deploying a kfp pipeline with intermediate data. The PVC is mounted to a volume but the container cannot mount that volume. It seems the cephfs/rbd Plugin pod is not working correctly(?). https://github.com/rook/rook/issues/4896#issuecomment-610186299
Deploying the pipeline below. https://github.com/rh-datascience-and-edge-practice/kubeflow-examples/blob/0b3b0f837b1b7ea988e0c9242ca016dfec9f2bd6/pipelines/11_iris_training_pipeline.py
Deploying a single busybox container pod with a PVC also shows the same error.
interm-pvc.yaml
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: interm-pvc
namespace: praxi
# labels:
# app: snapshot
spec:
storageClassName: ocs-external-storagecluster-ceph-rbd
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
fake-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: snapshot-fake-deployment
namespace: praxi
spec:
replicas: 1
selector:
matchLabels:
app: snapshot
template:
metadata:
labels:
app: snapshot
spec:
containers:
- name: snapshot-fake
image: busybox:latest
imagePullPolicy: "IfNotPresent"
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do sleep 30; done;" ]
volumeMounts:
- mountPath: /fake-snapshot
name: snapshot-vol1
volumes:
- name: snapshot-vol1
persistentVolumeClaim:
claimName: interm-pvc
The storage class issue was fixed by enforce node affinity to avoid those newly introduced GPU nodes. There seemst to be some permissions haven't been setup. https://github.com/OCP-on-NERC/operations/issues/170
For now my fix is to apply the node affinity to my components. The following is an example to add_affinity
to generate_loadmod_op
component.
# create affinity objects
terms = kubernetes.client.models.V1NodeSelectorTerm(
match_expressions=[
{'key': 'kubernetes.io/hostname',
'operator': 'NotIn',
'values': ["wrk-10", "wrk-11"]}
]
)
node_selector = kubernetes.client.models.V1NodeSelector(node_selector_terms=[terms])
node_affinity = kubernetes.client.models.V1NodeAffinity(
required_during_scheduling_ignored_during_execution=node_selector
)
affinity = kubernetes.client.models.V1Affinity(node_affinity=node_affinity)
model = generate_loadmod_op().apply(use_image_pull_policy()).add_affinity(affinity)
A working pipeline is shown here. https://github.com/ai4cloudops/Praxi-Pipeline/blob/7fac19b79ac56f41b098d5adb380a510038f3ddf/Praxi-Pipeline-xgb.py
Some useful pointers to recall
kfp_tekton~=1.5.0
https://github.com/adrien-legros/rhods-mnist/blob/main/docs/lab-instructions.md#add-a-pipeline-step
https://github.com/rh-datascience-and-edge-practice/kubeflow-examples/tree/0b3b0f837b1b7ea988e0c9242ca016dfec9f2bd6/pipelines
https://github.com/cert-manager/openshift-routes#usage
Thank you!
It seems the AWS access key in mlpipeline-minio-artifact
secret won't update automatically to reflect changes (new AWS access key and secret) in the data connection. This problem causes pods to fail with the following error.
ubuntu@test-retrieving-logs:~/Praxi-Pipeline$ oc logs submitted-pipeline-4c585-generate-changesets-pod -c step-copy-artifacts
tar: Removing leading `/' from member names
/tekton/home/tep-results/args
upload failed: ./args.tgz to s3://rhods-data-connection/artifacts/submitted-pipeline-4c585/generate-changesets/args.tgz An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.
tar: Removing leading `/' from member names
/tekton/home/tep-results/cs
upload failed: ./cs.tgz to s3://rhods-data-connection/artifacts/submitted-pipeline-4c585/generate-changesets/cs.tgz An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.
The solution is to manully update the mlpipeline-minio-artifact
secret with new AWS access key and secret. This can be done through openshift console.
Description
It seems the Kubeflow Pipelines SDK cannot be used with RHODS Pipeline at the moment. The Kubeflow pipeline endpoint is not exposed.
Forwarding
ds-pipeline-pipelines-definition
service in my openshift project(ns) didn't solve the problem as my code (kpf_tekton
with mybearer token
, adapted from here) complainingcertificate verify failed
. Also it is not safe to simply forward the service.Trevor Royer commented the problem could be that "the container likely has a cert built into it that is self signed so your cert verification fails."
Proposed Solution
Trevor suggested to add a route for the
oauth
port in the service, e.g.,oc create route reencrypt --service=dsp-def-service --port=oauth
. He mentioned "you can setup a route and the cluster will create a new cert that is already trusted for you."He also mentioned some workarounds. While I think we need a permenant fix to this problem, I am listing them here for record.
kpf_tekton
may provide an option to allow you to connect without authenticating the cert"Reproducibility
Notes
oc describe <pod>
andAWS Cloudtrail
)<route>/pipelines
and in dsp you will need to use just the route. No/pipelines
"data-science-pipelines-defenition
and the route for the API endpoint will be the one pointing towards that"kfp_tekton
. Any normal pipeline defenition pieces usekfp
."