Closed hsezhiyan closed 4 years ago
I will look at it. But the error msg means that when the UI backend tries to parse the CR of the workflow, the status field is malformed.
Based on my experience, usually it means that Argo had failed in such a way which the status event is not emitted or captured.
But let me see if I can reproduce it
Thank you @eedorenko, this is a closely related issue: https://github.com/kubeflow/pipelines/issues/2705#issuecomment-681967800. Let me know if you need help reproducing the bug.
For more context, I'm attaching an image of our pipeline. One thing we noticed is that every step except for the first is duplicated. In each pair of duplicated steps, the second has correct logs but the first doesn't log correct. So in the image, Train model io
works but train-model-io
doesn't.
Furthermore, the steps that aren't logging correctly run in pods that have type: Retry
, where correctly functioning pods have type: Pod
.
Failing pod's spec (note the type: Retry
at the bottom):
mnist-pytorch-example-zn7gp-399244972:
boundaryID: mnist-pytorch-example-zn7gp
children:
- mnist-pytorch-example-zn7gp-1204333607
displayName: evaluate-io
finishedAt: "2020-08-27T19:27:42Z"
id: mnist-pytorch-example-zn7gp-399244972
inputs:
parameters:
- name: batch_size
value: "64"
- name: input_data_path
value: mnist_pytorch_example/data
- name: out_path
value: /opt/zillow/shared/mnist_cnn.pth
- name: test_accuracy_threshold
value: "0.9"
- name: test_batch_size
value: "64"
- name: train-model-io-create-shared-vol-name
value: mnist-pytorch-example-zn7gp-shared-vol
- name: train_accuracy_threshold
value: "0.9"
name: mnist-pytorch-example-zn7gp.evaluate-io
outputs:
artifacts:
- name: mlpipeline-ui-metadata
optional: true
path: /tmp/outputs/mlpipeline_ui_metadata/data
s3:
accessKeySecret:
key: accesskey
name: mlpipeline-minio-artifact
bucket: mlpipeline
endpoint: minio-service.kubeflow:9000
insecure: true
key: artifacts/mnist-pytorch-example-zn7gp/mnist-pytorch-example-zn7gp-1204333607/mlpipeline-ui-metadata.tgz
secretKeySecret:
key: secretkey
name: mlpipeline-minio-artifact
- name: mlpipeline-metrics
optional: true
path: /tmp/outputs/mlpipeline_metrics/data
s3:
accessKeySecret:
key: accesskey
name: mlpipeline-minio-artifact
bucket: mlpipeline
endpoint: minio-service.kubeflow:9000
insecure: true
key: artifacts/mnist-pytorch-example-zn7gp/mnist-pytorch-example-zn7gp-1204333607/mlpipeline-metrics.tgz
secretKeySecret:
key: secretkey
name: mlpipeline-minio-artifact
phase: Succeeded
startedAt: "2020-08-27T19:26:20Z"
type: Retry
Working pod's spec (note the type: Pod
at the bottom):
mnist-pytorch-example-zn7gp-1204333607:
boundaryID: mnist-pytorch-example-zn7gp
displayName: evaluate-io(0)
finishedAt: "2020-08-27T19:27:41Z"
id: mnist-pytorch-example-zn7gp-1204333607
name: mnist-pytorch-example-zn7gp.evaluate-io(0)
outputs:
artifacts:
- name: mlpipeline-ui-metadata
optional: true
path: /tmp/outputs/mlpipeline_ui_metadata/data
s3:
accessKeySecret:
key: accesskey
name: mlpipeline-minio-artifact
bucket: mlpipeline
endpoint: minio-service.kubeflow:9000
insecure: true
key: artifacts/mnist-pytorch-example-zn7gp/mnist-pytorch-example-zn7gp-1204333607/mlpipeline-ui-metadata.tgz
secretKeySecret:
key: secretkey
name: mlpipeline-minio-artifact
- name: mlpipeline-metrics
optional: true
path: /tmp/outputs/mlpipeline_metrics/data
s3:
accessKeySecret:
key: accesskey
name: mlpipeline-minio-artifact
bucket: mlpipeline
endpoint: minio-service.kubeflow:9000
insecure: true
key: artifacts/mnist-pytorch-example-zn7gp/mnist-pytorch-example-zn7gp-1204333607/mlpipeline-metrics.tgz
secretKeySecret:
key: secretkey
name: mlpipeline-minio-artifact
phase: Succeeded
startedAt: "2020-08-27T19:26:20Z"
templateName: evaluate-io
type: Pod
@hsezhiyan
Is it possible for u to provide me with a dummy pipeline.yaml so I can reproduce it on my machine?
Did u use the set_retry
method?
op.set_retry(2)
I was unable to reproduce on my machine. But I looked thru the codes, this particular error is raised when we are unable to retrieve the argo workflow CR properly.
Either it is a permission issue (unlikely if other pods works), or we inferred the workflow name wrongly (we just drop the last segment from the podname), or the wrong podname is passed to the backend.
Is it possible for u to share ur pipeline yaml, and if possible the actual API request which fails (the query strings - i.e. podname), and also the full workflow yaml?
Hello @eterna2 ,
Of course! Thank you for getting back to me. I didn't use the set_retry
method.
Here is the pipeline yaml file. Here is the workflow yaml file .
Here are examples of failed API calls (retrieved using the inspect
functionality on Chrome):
logs?podname=mnist-pytorch-example-f9prl-3877604630&podnamespace=hariharans
logs?podname=mnist-pytorch-example-f9prl-1107911560&podnamespace=hariharans
logs?podname=mnist-pytorch-example-f9prl-1507110387&podnamespace=hariharans
Ok I figured out the reasons for the errors. Although not too sure how can we solve this.
The reason for the error is because argo does not archive "Retry" pods at all. And it seems either the pod is never created or was gc the moment it errored out (before any logs can be archived to minio/s3).
I might need to write a small pod event monitor app and track whether the pod is created or not, and whether was it gc too quickly (so I can know is it an argo issue or a kfp issue).
Ok I did more investigation. The "Retry" pod was never created as I did not capture any pod events from it at all.
I think we might need to raise a ticket upstream to argo. I have no idea why this happens (creating a virtual pod?).
@Bobgy
I did more tests. This is an UI bug. When u set a retry limit for your ops, argo will always create a Retry
node which is not an actual physical pod, hence this node does not have any logs or pod info.
And this node will always exist regardless if there is an actual retry or not.
So there are 2 solution to this issue. We do not render Retry
nodes at all. Or we give Retry
nodes a separate color scheme with no metadata at all (i.e. no artifacts, no logs, no pod info, etc) - as these info will be reflect in the actual Pod
node.
Thank you for the detailed investigation! Which solution Do you think is easier to implement?
I am still not fully understanding the topology here.
I have looked at the DAG graph rendering code. There is an existing function to filter our virtual nodes. I just need to add Retry
as another virtual node.
I will be making a PR to reflect this.
This is the original DAG with the Retry
node (2nd node).
My PR will remove this virtual node.
What steps did you take:
Made a simple KFP run on using an MNIST PyTorch training/evaluation script.
What happened:
When making a KFP run, occasionally the UI does not properly retrieve logs, and gives the error
Failed to retrieve pod logs.
By inspecting the Chrome browser network setting, it appears to be an API call issue with the following error:Could not get main container logs: Error: Unable to retrieve workflow status: [object Object].
.What did you expect to happen:
All the steps in the pipeline UI should correctly return logs.
Environment:
How did you deploy Kubeflow Pipelines (KFP)? Deployed KFP on Amazon EKS infrastructure.
KFP version:
v1.0.0
KFP SDK version:
v1.0.0
/kind bug