UI doesn't correctly return logs

hsezhiyan commented 4 years ago

What steps did you take:

Made a simple KFP run on using an MNIST PyTorch training/evaluation script.

What happened:

When making a KFP run, occasionally the UI does not properly retrieve logs, and gives the error Failed to retrieve pod logs. By inspecting the Chrome browser network setting, it appears to be an API call issue with the following error: Could not get main container logs: Error: Unable to retrieve workflow status: [object Object]..

What did you expect to happen:

All the steps in the pipeline UI should correctly return logs.

Environment:

How did you deploy Kubeflow Pipelines (KFP)? Deployed KFP on Amazon EKS infrastructure.

KFP version: v1.0.0

KFP SDK version: v1.0.0

/kind bug

eterna2 commented 4 years ago

I will look at it. But the error msg means that when the UI backend tries to parse the CR of the workflow, the status field is malformed.

Based on my experience, usually it means that Argo had failed in such a way which the status event is not emitted or captured.

But let me see if I can reproduce it

hsezhiyan commented 4 years ago

Thank you @eedorenko, this is a closely related issue: https://github.com/kubeflow/pipelines/issues/2705#issuecomment-681967800. Let me know if you need help reproducing the bug.

For more context, I'm attaching an image of our pipeline. One thing we noticed is that every step except for the first is duplicated. In each pair of duplicated steps, the second has correct logs but the first doesn't log correct. So in the image, Train model io works but train-model-io doesn't.

Furthermore, the steps that aren't logging correctly run in pods that have type: Retry, where correctly functioning pods have type: Pod.

Failing pod's spec (note the type: Retry at the bottom):

mnist-pytorch-example-zn7gp-399244972:
      boundaryID: mnist-pytorch-example-zn7gp
      children:
      - mnist-pytorch-example-zn7gp-1204333607
      displayName: evaluate-io
      finishedAt: "2020-08-27T19:27:42Z"
      id: mnist-pytorch-example-zn7gp-399244972
      inputs:
        parameters:
        - name: batch_size
          value: "64"
        - name: input_data_path
          value: mnist_pytorch_example/data
        - name: out_path
          value: /opt/zillow/shared/mnist_cnn.pth
        - name: test_accuracy_threshold
          value: "0.9"
        - name: test_batch_size
          value: "64"
        - name: train-model-io-create-shared-vol-name
          value: mnist-pytorch-example-zn7gp-shared-vol
        - name: train_accuracy_threshold
          value: "0.9"
      name: mnist-pytorch-example-zn7gp.evaluate-io
      outputs:
        artifacts:
        - name: mlpipeline-ui-metadata
          optional: true
          path: /tmp/outputs/mlpipeline_ui_metadata/data
          s3:
            accessKeySecret:
              key: accesskey
              name: mlpipeline-minio-artifact
            bucket: mlpipeline
            endpoint: minio-service.kubeflow:9000
            insecure: true
            key: artifacts/mnist-pytorch-example-zn7gp/mnist-pytorch-example-zn7gp-1204333607/mlpipeline-ui-metadata.tgz
            secretKeySecret:
              key: secretkey
              name: mlpipeline-minio-artifact
        - name: mlpipeline-metrics
          optional: true
          path: /tmp/outputs/mlpipeline_metrics/data
          s3:
            accessKeySecret:
              key: accesskey
              name: mlpipeline-minio-artifact
            bucket: mlpipeline
            endpoint: minio-service.kubeflow:9000
            insecure: true
            key: artifacts/mnist-pytorch-example-zn7gp/mnist-pytorch-example-zn7gp-1204333607/mlpipeline-metrics.tgz
            secretKeySecret:
              key: secretkey
              name: mlpipeline-minio-artifact
      phase: Succeeded
      startedAt: "2020-08-27T19:26:20Z"
      type: Retry

Working pod's spec (note the type: Pod at the bottom):

mnist-pytorch-example-zn7gp-1204333607:
      boundaryID: mnist-pytorch-example-zn7gp
      displayName: evaluate-io(0)
      finishedAt: "2020-08-27T19:27:41Z"
      id: mnist-pytorch-example-zn7gp-1204333607
      name: mnist-pytorch-example-zn7gp.evaluate-io(0)
      outputs:
        artifacts:
        - name: mlpipeline-ui-metadata
          optional: true
          path: /tmp/outputs/mlpipeline_ui_metadata/data
          s3:
            accessKeySecret:
              key: accesskey
              name: mlpipeline-minio-artifact
            bucket: mlpipeline
            endpoint: minio-service.kubeflow:9000
            insecure: true
            key: artifacts/mnist-pytorch-example-zn7gp/mnist-pytorch-example-zn7gp-1204333607/mlpipeline-ui-metadata.tgz
            secretKeySecret:
              key: secretkey
              name: mlpipeline-minio-artifact
        - name: mlpipeline-metrics
          optional: true
          path: /tmp/outputs/mlpipeline_metrics/data
          s3:
            accessKeySecret:
              key: accesskey
              name: mlpipeline-minio-artifact
            bucket: mlpipeline
            endpoint: minio-service.kubeflow:9000
            insecure: true
            key: artifacts/mnist-pytorch-example-zn7gp/mnist-pytorch-example-zn7gp-1204333607/mlpipeline-metrics.tgz
            secretKeySecret:
              key: secretkey
              name: mlpipeline-minio-artifact
      phase: Succeeded
      startedAt: "2020-08-27T19:26:20Z"
      templateName: evaluate-io
      type: Pod

eterna2 commented 4 years ago

@hsezhiyan

Is it possible for u to provide me with a dummy pipeline.yaml so I can reproduce it on my machine?

Did u use the set_retry method?

op.set_retry(2)

I was unable to reproduce on my machine. But I looked thru the codes, this particular error is raised when we are unable to retrieve the argo workflow CR properly.

Either it is a permission issue (unlikely if other pods works), or we inferred the workflow name wrongly (we just drop the last segment from the podname), or the wrong podname is passed to the backend.

Is it possible for u to share ur pipeline yaml, and if possible the actual API request which fails (the query strings - i.e. podname), and also the full workflow yaml?

hsezhiyan commented 4 years ago

Hello @eterna2 ,

Of course! Thank you for getting back to me. I didn't use the set_retry method.

Here is the pipeline yaml file. Here is the workflow yaml file .

Here are examples of failed API calls (retrieved using the inspect functionality on Chrome): logs?podname=mnist-pytorch-example-f9prl-3877604630&podnamespace=hariharans logs?podname=mnist-pytorch-example-f9prl-1107911560&podnamespace=hariharans logs?podname=mnist-pytorch-example-f9prl-1507110387&podnamespace=hariharans

eterna2 commented 4 years ago

Ok I figured out the reasons for the errors. Although not too sure how can we solve this.

The reason for the error is because argo does not archive "Retry" pods at all. And it seems either the pod is never created or was gc the moment it errored out (before any logs can be archived to minio/s3).

I might need to write a small pod event monitor app and track whether the pod is created or not, and whether was it gc too quickly (so I can know is it an argo issue or a kfp issue).

eterna2 commented 4 years ago

Ok I did more investigation. The "Retry" pod was never created as I did not capture any pod events from it at all.

I think we might need to raise a ticket upstream to argo. I have no idea why this happens (creating a virtual pod?).

eterna2 commented 4 years ago

@Bobgy

I did more tests. This is an UI bug. When u set a retry limit for your ops, argo will always create a Retry node which is not an actual physical pod, hence this node does not have any logs or pod info.

And this node will always exist regardless if there is an actual retry or not.

So there are 2 solution to this issue. We do not render Retry nodes at all. Or we give Retry nodes a separate color scheme with no metadata at all (i.e. no artifacts, no logs, no pod info, etc) - as these info will be reflect in the actual Pod node.

Bobgy commented 4 years ago

Thank you for the detailed investigation! Which solution Do you think is easier to implement?

I am still not fully understanding the topology here.

eterna2 commented 4 years ago

I have looked at the DAG graph rendering code. There is an existing function to filter our virtual nodes. I just need to add Retry as another virtual node.

I will be making a PR to reflect this.

This is the original DAG with the Retry node (2nd node).

My PR will remove this virtual node.

kubeflow / pipelines