kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.62k stars 1.63k forks source link

feat[frontend]: implement artifact-repositories configmap support #11354

Closed droctothorpe closed 1 day ago

droctothorpe commented 2 weeks ago

Description of your changes:

Argo Workflows supports the ability to specify an artifact repository for archived logs uniquely for each namespace rather than just globally as documented here.

This PR adds support for this functionality to the KFP UI. If a workflow runs and...

  1. The corresponding pods are deleted
  2. The corresponding workflow is deleted
  3. There is an artifact-repositories configmap in the target namespace that provides a unique, namespaced keyFormat

...this PR ensures that logs are still accessible in the UI. This functionality is disabled by default to avoid unnecessary network calls (for end users who don't leverage namespace-specific artifact-repositories). It can be toggled on through configs.ts or an environment variable.

Co-authored with @quinnovator 🌟 .

Fixes: https://github.com/kubeflow/pipelines/issues/11339

Checklist:

droctothorpe commented 2 weeks ago

@HumairAK to recreate the edge case:

  1. Deploy KFP to a local K8s cluster.
  2. Set ARGO_ARCHIVE_LOGS=true in configs.ts.
  3. cd into the frontend directory.
  4. Run NAMESPACE=kubeflow npm run start:proxy-and-server to start the UI on your host machine.
  5. Create the following configmap in the kubeflow namespace:
    metadata:
    annotations:
    workflows.argoproj.io/default-artifact-repository: artifact-repositories
    name: artifact-repositories
    namespace: kubeflow
    data:
    artifact-repositories: |-
    archiveLogs: true
    s3:
      accessKeySecret:
        key: accesskey
        name: mlpipeline-minio-artifact
      bucket: mlpipeline
      endpoint: minio-service.kubeflow:9000
      insecure: true
      keyFormat: foo
      secretKeySecret:
        key: secretkey
        name: mlpipeline-minio-artifact
    kind: ConfigMap
  6. Run a simple pipeline.
  7. Once it completes, delete the AWF workflow custom resource associated with the run.
  8. Try to access the logs of the run in the UI running on your host machine. On master, this operation will fail. On this branch, it will succeed.
  9. The next few steps are optional. Port-forward the minio pod.
  10. Navigate to the minio UI.
  11. Login (username: minio, password: minio123).
  12. Note how the logs are stored in mlpipeline/foo rather than the default keyFormat specified in configs.ts. Argo grants precedence to the keyFormat specified in the artifact-repositories configmap over the keyFormat specified in workflow-controller-configmap.
droctothorpe commented 2 weeks ago

I updated the instructions to reproduce to edge case to make manual validation a lot easier by running the UI on your host machine.

HumairAK commented 1 week ago

Hey @droctothorpe /@quinnovator, thanks for this. I'm trying this out now and encountering some hiccups. I suspect it might be my environment/setup, but I want to confirm with you first.

  1. When buildling the image and trying this out with a pod deployment of the UI, I'm not seeing this working, what I see:

image

the UI yaml is as posted below:

deployment.yaml ```yaml kind: Deployment apiVersion: apps/v1 metadata: annotations: deployment.kubernetes.io/revision: '3' kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{},"labels":{"app":"ml-pipeline-ui","application-crd-id":"kubeflow-pipelines"},"name":"ml-pipeline-ui","namespace":"kubeflow"},"spec":{"selector":{"matchLabels":{"app":"ml-pipeline-ui","application-crd-id":"kubeflow-pipelines"}},"template":{"metadata":{"annotations":{"cluster-autoscaler.kubernetes.io/safe-to-evict":"true"},"labels":{"app":"ml-pipeline-ui","application-crd-id":"kubeflow-pipelines"}},"spec":{"containers":[{"env":[{"name":"DISABLE_GKE_METADATA","value":"true"},{"name":"VIEWER_TENSORBOARD_POD_TEMPLATE_SPEC_PATH","value":"/etc/config/viewer-pod-template.json"},{"name":"MINIO_NAMESPACE","valueFrom":{"fieldRef":{"fieldPath":"metadata.namespace"}}},{"name":"MINIO_ACCESS_KEY","valueFrom":{"secretKeyRef":{"key":"accesskey","name":"mlpipeline-minio-artifact"}}},{"name":"MINIO_SECRET_KEY","valueFrom":{"secretKeyRef":{"key":"secretkey","name":"mlpipeline-minio-artifact"}}},{"name":"ALLOW_CUSTOM_VISUALIZATIONS","value":"true"},{"name":"FRONTEND_SERVER_NAMESPACE","valueFrom":{"fieldRef":{"fieldPath":"metadata.namespace"}}}],"image":"gcr.io/ml-pipeline/frontend:2.3.0","imagePullPolicy":"IfNotPresent","livenessProbe":{"exec":{"command":["wget","-q","-S","-O","-","http://localhost:3000/apis/v1beta1/healthz"]},"initialDelaySeconds":3,"periodSeconds":5,"timeoutSeconds":2},"name":"ml-pipeline-ui","ports":[{"containerPort":3000}],"readinessProbe":{"exec":{"command":["wget","-q","-S","-O","-","http://localhost:3000/apis/v1beta1/healthz"]},"initialDelaySeconds":3,"periodSeconds":5,"timeoutSeconds":2},"resources":{"requests":{"cpu":"10m","memory":"70Mi"}},"volumeMounts":[{"mountPath":"/etc/config","name":"config-volume","readOnly":true}]}],"serviceAccountName":"ml-pipeline-ui","volumes":[{"configMap":{"name":"ml-pipeline-ui-configmap"},"name":"config-volume"}]}}}} resourceVersion: '627852' name: ml-pipeline-ui uid: 4d454d73-05f6-4b01-b744-8a4c9aec05ec creationTimestamp: '2024-11-12T22:37:37Z' generation: 8 namespace: kubeflow labels: app: ml-pipeline-ui application-crd-id: kubeflow-pipelines spec: replicas: 0 selector: matchLabels: app: ml-pipeline-ui application-crd-id: kubeflow-pipelines template: metadata: creationTimestamp: null labels: app: ml-pipeline-ui application-crd-id: kubeflow-pipelines annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: 'true' spec: restartPolicy: Always serviceAccountName: ml-pipeline-ui schedulerName: default-scheduler terminationGracePeriodSeconds: 30 securityContext: {} containers: - resources: requests: cpu: 10m memory: 70Mi readinessProbe: exec: command: - wget - '-q' - '-S' - '-O' - '-' - 'http://localhost:3000/apis/v1beta1/healthz' initialDelaySeconds: 3 timeoutSeconds: 2 periodSeconds: 5 successThreshold: 1 failureThreshold: 3 terminationMessagePath: /dev/termination-log name: ml-pipeline-ui livenessProbe: exec: command: - wget - '-q' - '-S' - '-O' - '-' - 'http://localhost:3000/apis/v1beta1/healthz' initialDelaySeconds: 3 timeoutSeconds: 2 periodSeconds: 5 successThreshold: 1 failureThreshold: 3 env: - name: DISABLE_GKE_METADATA value: 'true' - name: VIEWER_TENSORBOARD_POD_TEMPLATE_SPEC_PATH value: /etc/config/viewer-pod-template.json - name: MINIO_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace - name: MINIO_ACCESS_KEY valueFrom: secretKeyRef: name: mlpipeline-minio-artifact key: accesskey - name: MINIO_SECRET_KEY valueFrom: secretKeyRef: name: mlpipeline-minio-artifact key: secretkey - name: ALLOW_CUSTOM_VISUALIZATIONS value: 'true' - name: FRONTEND_SERVER_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace - name: ARGO_ARTIFACT_REPOSITORIES_LOOKUP value: 'true' - name: ARGO_ARCHIVE_LOGS value: 'true' ports: - containerPort: 3000 protocol: TCP imagePullPolicy: IfNotPresent volumeMounts: - name: config-volume readOnly: true mountPath: /etc/config terminationMessagePolicy: File image: 'quay.io/hukhan/ds-pipelines-frontend:11354' serviceAccount: ml-pipeline-ui volumes: - name: config-volume configMap: name: ml-pipeline-ui-configmap defaultMode: 420 dnsPolicy: ClusterFirst strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 25% maxSurge: 25% revisionHistoryLimit: 10 progressDeadlineSeconds: 600 status: observedGeneration: 8 conditions: - type: Progressing status: 'True' lastUpdateTime: '2024-11-13T04:36:07Z' lastTransitionTime: '2024-11-12T22:37:37Z' reason: NewReplicaSetAvailable message: ReplicaSet "ml-pipeline-ui-5c5479dc67" has successfully progressed. - type: Available status: 'True' lastUpdateTime: '2024-11-13T22:54:29Z' lastTransitionTime: '2024-11-13T22:54:29Z' reason: MinimumReplicasAvailable message: Deployment has minimum availability. ```

And the artifact repository configmap:

kind: ConfigMap
apiVersion: v1
metadata:
  name: artifact-repositories
  namespace: kubeflow
  # edit: forgot to add this
  annotations:
    workflows.argoproj.io/default-artifact-repository: artifact-repositories
data:
  artifact-repositories: |-
    archiveLogs: true
    s3:
      accessKeySecret:
        key: accesskey
        name: mlpipeline-minio-artifact
      bucket: mlpipeline
      endpoint: minio-service.kubeflow:9000
      insecure: true
      keyFormat: custom_location/{{workflow.name}}/{{workflow.creationTimestamp.Y}}/{{workflow.creationTimestamp.m}}/{{workflow.creationTimestamp.d}}/{{pod.name}}
      secretKeySecret:
        key: secretkey
        name: mlpipeline-minio-artifact

As you can see I have set ARGO_ARTIFACT_REPOSITORIES_LOOKUP and ARGO_ARCHIVE_LOGS to true. I can also confirm it's got the PR changes by viewing the UI pod container logs, the UI config has:

{
argo: {
archiveArtifactory: 'minio',
archiveBucketName: 'mlpipeline',
archiveLogs: true,
keyFormat: 'artifacts/{{workflow.name}}/{{workflow.creationTimestamp.Y}}/{{workflow.creationTimestamp.m}}/{{workflow.creationTimestamp.d}}/{{pod.name}}',
artifactRepositoriesLookup: true      

^^^^^ PR change
  1. When trying this locally, I see some strange behavior where one of the task nodes is fetching the logs from object store, but not the other task. In some scenarios both fail to fetch it. My suspicion is there is some weird timeout behavior that's faulty with connecting to object store on my side, but wondering if you encounter this behavior at all:

logs-not-pulled

The pipeline used:

pipeline.yaml ```python # PIPELINE DEFINITION # Name: tutorial-data-passing-2 # Inputs: # message: str [Default: 'message'] components: comp-preprocess: executorLabel: exec-preprocess inputDefinitions: parameters: message: parameterType: STRING outputDefinitions: artifacts: output_dataset_one: artifactType: schemaTitle: system.Dataset schemaVersion: 0.0.1 output_dataset_two_path: artifactType: schemaTitle: system.Dataset schemaVersion: 0.0.1 parameters: output_bool_parameter_path: parameterType: BOOLEAN output_dict_parameter_path: parameterType: STRUCT output_list_parameter_path: parameterType: LIST output_parameter_path: parameterType: STRING comp-train: executorLabel: exec-train inputDefinitions: artifacts: dataset_one_path: artifactType: schemaTitle: system.Dataset schemaVersion: 0.0.1 dataset_two: artifactType: schemaTitle: system.Dataset schemaVersion: 0.0.1 parameters: input_bool: parameterType: BOOLEAN input_dict: parameterType: STRUCT input_list: parameterType: LIST message: parameterType: STRING num_steps: defaultValue: 100.0 isOptional: true parameterType: NUMBER_INTEGER outputDefinitions: artifacts: model: artifactType: schemaTitle: system.Model schemaVersion: 0.0.1 deploymentSpec: executors: exec-preprocess: container: args: - --executor_input - '{{$}}' - --function_to_execute - preprocess command: - sh - -c - "\nif ! [ -x \"$(command -v pip)\" ]; then\n python3 -m ensurepip ||\ \ python3 -m ensurepip --user || apt-get install python3-pip\nfi\n\nPIP_DISABLE_PIP_VERSION_CHECK=1\ \ python3 -m pip install --quiet --no-warn-script-location 'kfp==2.3.0'\ \ '--no-deps' 'typing-extensions>=3.7.4,<5; python_version<\"3.9\"' && \"\ $0\" \"$@\"\n" - sh - -ec - 'program_path=$(mktemp -d) printf "%s" "$0" > "$program_path/ephemeral_component.py" _KFP_RUNTIME=true python3 -m kfp.dsl.executor_main --component_module_path "$program_path/ephemeral_component.py" "$@" ' - "\nimport kfp\nfrom kfp import dsl\nfrom kfp.dsl import *\nfrom typing import\ \ *\n\ndef preprocess(\n # An input parameter of type string.\n message:\ \ str,\n # Use Output[T] to get a metadata-rich handle to the output\ \ artifact\n # of type `Dataset`.\n output_dataset_one: Output[Dataset],\n\ \ # A locally accessible filepath for another output artifact of type\n\ \ # `Dataset`.\n output_dataset_two_path: OutputPath('Dataset'),\n\ \ # A locally accessible filepath for an output parameter of type string.\n\ \ output_parameter_path: OutputPath(str),\n # A locally accessible\ \ filepath for an output parameter of type bool.\n output_bool_parameter_path:\ \ OutputPath(bool),\n # A locally accessible filepath for an output parameter\ \ of type dict.\n output_dict_parameter_path: OutputPath(Dict[str, int]),\n\ \ # A locally accessible filepath for an output parameter of type list.\n\ \ output_list_parameter_path: OutputPath(List[str]),\n):\n \"\"\"\ Dummy preprocessing step.\"\"\"\n\n # Use Dataset.path to access a local\ \ file path for writing.\n # One can also use Dataset.uri to access the\ \ actual URI file path.\n with open(output_dataset_one.path, 'w') as\ \ f:\n f.write(message)\n\n # OutputPath is used to just pass\ \ the local file path of the output artifact\n # to the function.\n \ \ with open(output_dataset_two_path, 'w') as f:\n f.write(message)\n\ \n with open(output_parameter_path, 'w') as f:\n f.write(message)\n\ \n with open(output_bool_parameter_path, 'w') as f:\n f.write(\n\ \ str(True)) # use either `str()` or `json.dumps()` for bool\ \ values.\n\n import json\n with open(output_dict_parameter_path,\ \ 'w') as f:\n f.write(json.dumps({'A': 1, 'B': 2}))\n\n with\ \ open(output_list_parameter_path, 'w') as f:\n f.write(json.dumps(['a',\ \ 'b', 'c']))\n\n" image: quay.io/opendatahub/ds-pipelines-ci-executor-image:v1.0 exec-train: container: args: - --executor_input - '{{$}}' - --function_to_execute - train command: - sh - -c - "\nif ! [ -x \"$(command -v pip)\" ]; then\n python3 -m ensurepip ||\ \ python3 -m ensurepip --user || apt-get install python3-pip\nfi\n\nPIP_DISABLE_PIP_VERSION_CHECK=1\ \ python3 -m pip install --quiet --no-warn-script-location 'kfp==2.3.0'\ \ '--no-deps' 'typing-extensions>=3.7.4,<5; python_version<\"3.9\"' && \"\ $0\" \"$@\"\n" - sh - -ec - 'program_path=$(mktemp -d) printf "%s" "$0" > "$program_path/ephemeral_component.py" _KFP_RUNTIME=true python3 -m kfp.dsl.executor_main --component_module_path "$program_path/ephemeral_component.py" "$@" ' - "\nimport kfp\nfrom kfp import dsl\nfrom kfp.dsl import *\nfrom typing import\ \ *\n\ndef train(\n # Use InputPath to get a locally accessible path\ \ for the input artifact\n # of type `Dataset`.\n dataset_one_path:\ \ InputPath('Dataset'),\n # Use Input[T] to get a metadata-rich handle\ \ to the input artifact\n # of type `Dataset`.\n dataset_two: Input[Dataset],\n\ \ # An input parameter of type string.\n message: str,\n # Use\ \ Output[T] to get a metadata-rich handle to the output artifact\n #\ \ of type `Model`.\n model: Output[Model],\n # An input parameter\ \ of type bool.\n input_bool: bool,\n # An input parameter of type\ \ dict.\n input_dict: Dict[str, int],\n # An input parameter of type\ \ List[str].\n input_list: List[str],\n # An input parameter of type\ \ int with a default value.\n num_steps: int = 100,\n):\n \"\"\"Dummy\ \ Training step.\"\"\"\n with open(dataset_one_path, 'r') as input_file:\n\ \ dataset_one_contents = input_file.read()\n\n with open(dataset_two.path,\ \ 'r') as input_file:\n dataset_two_contents = input_file.read()\n\ \ print(\"test3\")\n line = (f'dataset_one_contents: {dataset_one_contents}\ \ || '\n f'dataset_two_contents: {dataset_two_contents} || '\n\ \ f'message: {message} || '\n f'input_bool: {input_bool},\ \ type {type(input_bool)} || '\n f'input_dict: {input_dict},\ \ type {type(input_dict)} || '\n f'input_list: {input_list},\ \ type {type(input_list)} \\n')\n\n with open(model.path, 'w') as output_file:\n\ \ for i in range(num_steps):\n output_file.write('Step\ \ {}\\n{}\\n=====\\n'.format(i, line))\n\n # model is an instance of\ \ Model artifact, which has a .metadata dictionary\n # to store arbitrary\ \ metadata for the output artifact.\n model.metadata['accuracy'] = 0.9\n\ \n" image: quay.io/opendatahub/ds-pipelines-ci-executor-image:v1.0 pipelineInfo: name: tutorial-data-passing-2 root: dag: tasks: preprocess: cachingOptions: enableCache: true componentRef: name: comp-preprocess inputs: parameters: message: componentInputParameter: message taskInfo: name: preprocess train: cachingOptions: enableCache: true componentRef: name: comp-train dependentTasks: - preprocess inputs: artifacts: dataset_one_path: taskOutputArtifact: outputArtifactKey: output_dataset_one producerTask: preprocess dataset_two: taskOutputArtifact: outputArtifactKey: output_dataset_two_path producerTask: preprocess parameters: input_bool: taskOutputParameter: outputParameterKey: output_bool_parameter_path producerTask: preprocess input_dict: taskOutputParameter: outputParameterKey: output_dict_parameter_path producerTask: preprocess input_list: taskOutputParameter: outputParameterKey: output_list_parameter_path producerTask: preprocess message: taskOutputParameter: outputParameterKey: output_parameter_path producerTask: preprocess taskInfo: name: train inputDefinitions: parameters: message: defaultValue: message isOptional: true parameterType: STRING schemaVersion: 2.1.0 sdkVersion: kfp-2.3.0 ```

So question for you:

  1. Did you have this working with a built image and running in a pod? If so, I suspect I must have something configured incorrectly then.
  2. Does the above pipeline work fine for you when you juggle between either task's pod logs (after deleting the underlying WF) ? (or any other pipeline with more than one task node with logs)
droctothorpe commented 1 week ago

Thank you so much for taking the time to validate this, @HumairAK, and for being so thorough with documenting the issue that you encountered.

I'm going to try a couple of things to see if I can repro your results:

  1. Run the UI on host (using NAMESPACE=kubeflow npm run start:proxy-and-server), validate that I can toggle between multiple tasks.
  2. Run the UI on host, use your pipeline.
  3. Same as above but also use your configmap.
  4. Same as above but build and deploy the UI image instead of running it on the host.

Results:

  1. Run the UI on host, validate that I can toggle between multiple tasks. Worked.
image image
  1. Run the UI on host, use your pipeline. Worked.
image image
  1. Same as above but also use your configmap.

I think I see the problem. I did a comparison of the two configmaps:

image

Your configmap is missing the workflows.argoproj.io/default-artifact-repository: artifact-repositories annotation. From the AWF documentation:

image

Can you run the same test but make sure to include that annotation in your configmap?

Thank you! 🙏

HumairAK commented 1 week ago

Ah yeah I actually did include the annotation, when trying to prune the configmap of runtime information for the github post, I accidentally left it out.

I see that in your configmap your path is set to keyFormat: foo which makes it seem like it will just overwrite the main.log for each task, but in your (1) and (2), the logs are different. So I'm wondering if you cleaned up the underlying Workflows for those tests? With the given configmap I would have expected the logs view to show the same logs.

droctothorpe commented 1 week ago

Confirming that the workflows were deleted.

image

I see that in your configmap your path is set to keyFormat: foo which makes it seem like it will just overwrite the main.log for each task, but in your (1) and (2), the logs are different. With the given configmap I would have expected the logs view to show the same logs.

I actually had the keyFormat field in my configmap set to foo/artifacts/{{workflow.name}}/{{workflow.creationTimestamp.Y}}/{{workflow.creationTimestamp.m}}/{{workflow.creationTimestamp.d}}/{{pod.name}} from an earlier test where I wanted to confirm that the string replacement was happening properly. That's why the logs were different.

I changed keyFormat to a static string ("foo") to see what happens. Confirming that the logs are in fact identical for differing components in that scenario.

I'm still unable to repro the failure scenario. I'll try to build and deploy the image instead of running it on the host next.

HumairAK commented 6 days ago

AHA found the reason:

k8serr:Could not get configMap artifact-repositories in namespace kubeflow: configmaps "artifact-repositories" is forbidden: User "system:serviceaccount:kubeflow:ml-pipeline-ui" cannot get resource "configmaps" in API group "" in the namespace "kubeflow", additional info: [object Object]

we'll need to update the mlpipeline ui role for this

and we should probably surface this error so it's easier to find next time

HumairAK commented 2 days ago

/lgtm /approve

let me retry running the ci

google-oss-prow[bot] commented 2 days ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: HumairAK

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[frontend/OWNERS](https://github.com/kubeflow/pipelines/blob/master/frontend/OWNERS)~~ [HumairAK] - ~~[manifests/kustomize/OWNERS](https://github.com/kubeflow/pipelines/blob/master/manifests/kustomize/OWNERS)~~ [HumairAK] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
HumairAK commented 1 day ago

@droctothorpe are you guys willing to provide some supplementing docs for this functionality? I'm thinking https://www.kubeflow.org/docs/components/pipelines/operator-guides/ here makes the most sense

droctothorpe commented 1 day ago

Thanks for the review and merge, @HumairAK! Happy to add some supplemental docs. Would you prefer for it to be in a standalone page or appended to https://www.kubeflow.org/docs/components/pipelines/operator-guides/configure-object-store/?

HumairAK commented 1 day ago

I think we can create a standalone page on logging. Especially since we're planning future work on that front, there will probably be more to add later. WDYT?