kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.63k stars 1.63k forks source link

[feature] display main.log as artifact for each step #10036

Open Linchin opened 1 year ago

Linchin commented 1 year ago

Feature Area

/area frontend /area backend

What feature would you like to see?

A feature similar to v1 behavior. In the details page of each step, the log appears as one of the artifacts that users can view: image

In v2, this main log artifact is no longer displayed. It would be great if we could add a similar section to show the logs, or use the log artifact as a source for the "logs" panel after the pod is deleted, instead of directly pulling from the (already deleted) pod.

What is the use case or pain point?

Because all the logs of completed pods are auto deleted by Kubernetes after 24 hours, the users can no longer access the logs of earlier pipeline runs.

Is there a workaround currently?

No workaround.

/cc @kromanow94


Love this idea? Give it a 👍.

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

revolutionisme commented 10 months ago

I read that this issue was not included in the previous release as mentioned in this comment , just wanted to ask whether we have an update on this or not.

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 7 months ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

AnnKatrinBecker commented 6 months ago

/reopen

Is readding this feature to v2 planned for future releases? It would be very helpful.

google-oss-prow[bot] commented 6 months ago

@AnnKatrinBecker: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubeflow/pipelines/issues/10036#issuecomment-2109519119): >/reopen > >Is readding this feature to v2 planned for future releases? It would be very helpful. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
HumairAK commented 6 months ago

I'm a bit unfamiliar with the v1 version of this, did this come from argo's archive logging feature?

In v2 kfp aims to be agnostic to the backing engine, so if this is an argo specific feature, we may need to consider a separate feature/proposal on how to handle persistent logging in general for kfp.

TobiasGoerke commented 6 months ago

I'm a bit unfamiliar with the v1 version of this, did this come from argo's archive logging feature?

In v2 kfp aims to be agnostic to the backing engine, so if this is an argo specific feature, we may need to consider a separate feature/proposal on how to handle persistent logging in general for kfp.

Yes, Argo Workflows stores the logs separately and given only kfp metadata, there currently seems to be no way to obtain the logs' locations. You'd need to access the Argo Workflow object directly to obtain the info.

Including this feature right into kfp would make sense and imo is much needed, as otherwise, users aren't able to view logs after the step's pod has been deleted.

sanchesoon commented 6 months ago

Hi @HumairAK What do you think about this ? Points:

Thoughts:

HumairAK commented 6 months ago

/reopen

google-oss-prow[bot] commented 6 months ago

@HumairAK: Reopened this issue.

In response to [this](https://github.com/kubeflow/pipelines/issues/10036#issuecomment-2133639992): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
HumairAK commented 6 months ago

I agree persisted logging is very important, but we should clarify if we need them as artifacts, or do users just want to see logs after pods have been removed?

For example, if we have persisted logs, the log viewer could instead fetch the logs from the persisted location instead of pods (I believe this is the logic currently present in ui, though it may not be utilized in v2). Could it be confusing to see 2 different locations where logs are being shown?

DS engineer would be interesting to see logs as artifact, as part of models producer investigation

Not disagreeing, but could you elaborate on specifics?

I like the idea of decoupling implementation to existing solutions (or at least giving the user the option to, if we opt to natively implement),

in sdk we should be able disable logs

I think we should also allow users to disable/enable not only in the SDK, but via api & ui as well.

sanchesoon commented 6 months ago

Why engineers may interesting in this:

v1 version was implemented correctly from concept view, because it combine logs interaction during run (life time stream from pod) and logs persistent for long term.

Saving to object storage can be default, but it is also possible to use other solutions for logs streaming like, for example ELK, kafaka, or Google Logging.... so this can be optional

Current workaround with argoworkflow backend is using keyfomat like /artifacts/{workflow_id}/{pod_id}/main.log via patching artifact_ui split function and use v1 handler.

tom-pavz commented 5 months ago

Current workaround with argoworkflow backend is using keyfomat like /artifacts/{workflow_id}/{pod_id}/main.log via patching artifact_ui split function and use v1 handler.

@sanchesoon Can you give more details on this workaround? I am currently using KFP 2.1.0 with the normal argoworkflow backend. I really want a way to provide users the ability to view their pipeline tasks' logs simply by clicking through the KFP UI.

thesuperzapper commented 5 months ago

@kubeflow/pipelines @HumairAK @pdettori I would like to ask that we prioritize this issue, it is a significant regression for V2 pipelines.

The gist of the issue is that when Argo Workflows has the archiveLogs config set (which we do at the cluster level in our manifests), then Argo Workflows automatically creates a "output artifact" called main-logs which full Pod logs of each step as an S3/MinIO artifact.

In Kubeflow Pipelines V1, we used to capture this output artifact, which allowed users to see the old logs when the Pod had disappeared, but in V2 this was strangely omitted.

Fixing it should be very straightforward, we just need to capture the artifact location of the main-logs output of the executor step, and store it as an "Output Artifact" in the ML Metadata of the step.

For reference, we can retrieve this artifact by reading the status field of the Workflow resource:

status:
  nodes:
    pipeline-1-8g5bs-3160443754:
      displayName: executor
      outputs:
        artifacts:
        - name: main-logs
          s3:
            key: artifacts/team-1/pipeline-1-8g5bs/2024/06/13/pipeline-1-8g5bs-3160443754/main.log
droctothorpe commented 5 months ago

Completely agree with @thesuperzapper. Having a run fail and not being able to determine the reason why because the logs are not available is a show stopper.

moorthy156 commented 5 months ago

yes, This is show stopper and required to retrieve logs of completed DAGs

sanchesoon commented 5 months ago

Current workaround with argoworkflow backend is using keyfomat like /artifacts/{workflow_id}/{pod_id}/main.log via patching artifact_ui split function and use v1 handler.

@sanchesoon Can you give more details on this workaround? I am currently using KFP 2.1.0 with the normal argoworkflow backend. I really want a way to provide users the ability to view their pipeline tasks' logs simply by clicking through the KFP UI.

Hi Yes I can. In argo workflow you should use something like this:

    archiveLogs: true
    s3:
        keyFormat: {{ `"pod/log/{{workflow.name}}/{{pod.name}}"` }}

And patch frontend split function, frontend/server/workflow-helper.ts:

function workflowNameFromPodName(podName: string) {
  return podName.replace(/-system-container-impl-.*/, '')
}

And don't forget configure you kubeflow proxy frontend kubectl -n kubeflow describe po ml-pipeline-ui

      ARGO_ARCHIVE_LOGS:                          true
      ARGO_ARCHIVE_ARTIFACTORY:                   s3
      ARGO_ARCHIVE_PREFIX:                        pod/log
      ARGO_ARCHIVE_BUCKETNAME
      AWS_ACCESS_KEY_ID
      AWS_SECRET_ACCESS_KEY
      AWS_REGION
      AWS_SSL
      AWS_S3_ENDPOINT
      AWS_PORT

this tested in 2.2.0, in 2.1.0 probable should work too.

droctothorpe commented 5 months ago

We're hoping to take this fix on this sprint.

droctothorpe commented 4 months ago

My team and I are making good progress. We have a functional POC. It just needs some polish. It involves sending the run execution date to the logging handler since the current AWF configuration stores the log artifacts in date-based subfolders (to reduce the likelihood of collisions (see https://github.com/kubeflow/pipelines/pull/5894)). The existing code is more or less fine, it's just looking for the logs in the wrong location. The UX for this will be a bit different than v1. Logs will not be listed under output artifacts. Instead, when the logs tab is loaded, if the k8s pod is not accessible, it will populate the react component from the artifact store instead.

thesuperzapper commented 4 months ago

@droctothorpe however you do it, make sure that you don't assume anything about the structure of the Argo Workflows s3.keyFormat as that is not guaranteed to be the same in all environments.

For example, some users have changed their key format to a different prefix or path structure which may or may not be the same as the current default in Kubeflow Pipelines.

That's why I said in https://github.com/kubeflow/pipelines/issues/10036#issuecomment-2164166923 that you have to get the true path from the workflow status.

droctothorpe commented 4 months ago

@thesuperzapper, if I understand the logic correctly, the handler does in fact try to get the log artifact path from the workflow status first (if the k8s pods have been garbage collected). It only tries to call the artifact store directly (in a way that currently makes incorrect assumptions about where the log is located) if the workflow has been garbage collected.

The logic tries to retrieve the logs from multiple sources in the following sequence:

  1. It tries to retrieve the logs from the pod (link).
  2. If that fails, it tries to retrieve the logs path by extracting it from the workflow CR (function invocation link, function definition link (with some nesting)).
  3. If that fails and archiveLogs is true (this is set in config.ts), then and only then does it try to retrieve the logs directly from the artifact store without pathing through the workflow CR. Currently, master makes incorrect assumptions about the path (here is where it is constructed) that are at odds with the keyFormat default in the workflow-controller-configmap object.

@sanchesoon's suggestion essentially reverts the log location, but https://github.com/kubeflow/pipelines/pull/5894 and the comments in the configmap convey that dates were deliberately added to the path to reduce the likelihood of collisions.

Our fix essentially corrects the assumption and is only applicable if the workflow has been garbage collected.

If a user both modifies the keyFormat and deletes the workflow CR, option 3 will not work. Why would a user want to modify the keyFormat though? Is that a rare enough edge case that solving for it would constitute bike-shedding? If not, there may be a way to have the handler call the K8s API to lookup the workflow-controller-configmap, extract the keyFormat and translate it somehow in a dynamic way that the function could parse, but that would (a) add latency, and (b) probably be a PITA, especially figuring out how to map the Go templating syntax to specific parameters in the function in a way that properly accounts for the sundry variants in the template tags. Another option is to store the path in MLMD. That would dramatically broaden the scope of the PR though and may receive pushback because it entails storing extremely Argo-specific data in the persistent data store. Another option is adding something similar to keyFormat to config.ts so that it can be controlled dynamically using env vars which can be injected directly from the chart.

Relevant meme: image

kromanow94 commented 4 months ago

@droctothorpe

If a user both modifies the keyFormat and deletes the workflow CR, option 3 will not work.

If I understand correctly, this is only the case until the information about where the log artifact is stored is saved to the pipeline run artifacts, right? I take that this is the assumption because it wouldn't be too good to have a requirement of storing the Argo WF CR Objects on the etcd indefinitely. And if this assumption is right, we can use a very simple K8s mechanism to prevent users from manually deleting the WF CR objects before the information about where is the log artifact is stored to Pipeline Run Artifacts - finalizers.

If we were to use finalizers, we can just make the assumption that we control when exactly the Argo WF CR is released to be GCed and simplify the scenario a bit.

Why would a user want to modify the keyFormat though? Is that a rare enough edge case that solving for it would constitute bike-shedding?

There can be some requirement or limitation where you'd have to reuse the same S3 Bucket for multiple purposes (like multiple KF Instances or a general bucket for logs from multiple sources) because of how a company/organization manages the infrastructure (so in this example you're given an S3 Bucket with a specific prefix where you can save the logs artifacts, lets say /artifacts/kubeflow/{{workflow.namespace}}/pod/log/{{workflow.name}}/{{pod.name}}.

Another example would be if each KF Profile is serving a different team and you're hoping to achieve full Profile isolation where Team A doesn't have any form of access to the logs of Team B. This is something similar to what @thesuperzapper pointed out in the https://github.com/kubeflow/pipelines/issues/10036#issuecomment-2164166923: artifacts/team-1/pipeline-1-8g5bs/2024/06/13/pipeline-1-8g5bs-3160443754/main.log.

Also, in enterprise environment, you're rarely given the full freedom over your infrastructure and it would be a real issue if something as simple as the keyFormat would be a blocker in adopting KF.

droctothorpe commented 4 months ago

@kromanow94 Thanks for the context!

I take that this is the assumption because it wouldn't be too good to have a requirement of storing the Argo WF CR Objects on the etcd indefinitely.

Correct. Workflows gets GCed after a specified length of time as defined by the TTLStrategy field. Even after workflows are deleted, the corresponding logs persist in the artifact store.

Any objections to letting end users specify the typescript equivalent of an f-string through an env var via config.ts? We could document some predetermined variable names that map to what's available in the corresponding handler (e.g. namespace, pod name, date, etc.), and add a comment to the keyFormat field stating that if it's modified, the corresponding field in config.ts should be updated.

The alternative is making the logs explicit output artifacts in MLMD. I would be curious to hear from Google about why that was disabled in v2 before reverting it. Output artifacts are first class citizens with dedicated nodes in the v2 graphs; we probably don't want every component to have a corresponding log node in the DAG, so we'd have to handle that in the graph rendering logic as well.

kromanow94 commented 4 months ago

@droctothorpe

Any objections to letting end users specify the typescript equivalent of an f-string through an env var via config.ts?

More a doubt than an objection - maybe going for the env var via config.ts is just not required at all? If we were to use the finalizer pattern, we assume that it's not required to reverse engineer where the logs might be stored. I think the information about the log URI location should be an integral part of the KFP Step and once that step is finished, we should only reference the KFP Step to get the information where the log artifact is stored.

But, maybe there is something obvious I don't see that conflicts with this approach. At end, what I think makes sense is to somehow save the address of the logs and put it together with the Step Execution. If this can be stored again in the MLMD, great. If not, is there any other place where we can store it? Maybe this is the part I don't yet fully understand - if the log artifact URI is not saved in the MLMD, do we have any other place where we can store that information, or currently we're forced to guess where the logs might be through the config.ts?

droctothorpe commented 4 months ago

if the log artifact URI is not saved in the MLMD, do we have any other place where we can store that information, or currently we're forced to guess where the logs might be through the config.ts?

Currently, the log artifact URI is not saved in MLMD. The UI is just guessing and guessing incorrectly, heh. If we store it as an output artifact in MLMD, that removes any guess work but substantially broadens the scope of the changes and will probably require design review from Google, which is doable ofc, it just delays things.

kromanow94 commented 4 months ago

@droctothorpe jfc...

droctothorpe commented 4 months ago

Another argument in favor of storing the log as an output artifact is what if an admin elects to update the keyFormat? The logs for historical runs will no longer be accessible, and there won't be a way for the UI to support both the old and new patterns at once.

kromanow94 commented 4 months ago

My personal preference for maturity and stability would be to go with the MLMD and google review or an alternative store if possible. I'd even investigate if creating additional table in the mlpipeline db makes sense just to store the log artifacts URIs for given Step ID. Considering all the discussion, I don't feel competent enough to suggest if the workaround with the env var in the config.ts is suitable for now... If I were to make this decision myself, I wouldn't do it.

droctothorpe commented 4 months ago

I'm starting to lean in that direction as well, but what the heck do I know, I am just a ~plebe~ user.

Do ya'll think it make sense to submit a smaller, discrete PR that corrects the UI's incorrect guess for the time being until we can implement a broader fix that adds the log URI to MLMD? We can aim to submit the PR tomorrow (at AWS Summit all day today) to give you a sense of the scope.

kromanow94 commented 4 months ago

If this can be added to the KF 1.9 then maybe it makes sense to unblock this with the idea you've proposed with config.ts. If it's not going to be a part of the v1.9 release because how short on time we are, maybe not so much.

thesuperzapper commented 4 months ago

I just want to highlight what I said in https://github.com/kubeflow/pipelines/issues/10036#issuecomment-2164166923 again.

This is very clearly an accidental regression, because KFP V1 stores the logs in an MLMD artifact (it just did not automatically forward the logs tab to the appropriate artifacts and required users to click the step and view it).

The only additional complexity (which is probably why this was overlooked) is that in KFP v2 now each step is more than one container, each with its own logs.

Assuming we store the logs for all steps as an artifact (even the non-executor steps) we should correctly show them separately in the logs tab probably with some kind of sub-tab for each container.

droctothorpe commented 4 months ago

@chensun @zijianjoy @connor-mccarthy Do you have any objections to adding the driver and executor logs to the list of output artifacts for each component? I just want to make sure that removing that functionality was not a deliberate design choice before we re-implement it in v2. Thank you!

droctothorpe commented 4 months ago

I think the executor outputs are being extracted from the IR, and currently the IR doesn't include information about the container logs. We need to decide if we want to include the container logs as executor outputs, which is quite complicated because (a) the path is dynamically inferred by keyFormat (for AWF) and not available client-side at IR compilation time, and (b) is extremely AWF specific in a way that could undermine workflow orchestration agnosticism in the IR.

droctothorpe commented 4 months ago

POC for the interim solution: https://github.com/kubeflow/pipelines/pull/11010. This will make it possible for anyone to implement a solution in a way that accommodates their unique keyFormat syntax without having to modify the actual code.

I plan to join the community call tomorrow to hear from Google about the proposed target-state solution of storing logs as output artifacts.

droctothorpe commented 3 months ago

https://github.com/kubeflow/pipelines/pull/11010 was just merged 🎉 . It might make make sense to keep this issue open to continue discussion about wether or not to / how to surface driver and executor logs as explicit outputs in the GUI.

kromanow94 commented 2 months ago

Hey, I noticed the PR is already included in the 2.3.0. I'm trying to run some tests but I get the following error and I'm not able to get the logs from the Pod Step. I'm having two issues:

The ml-pipeline-ui Pod is exiting the main process with the following error:

/server/node_modules/node-fetch/lib/index.js:1491
                        reject(new FetchError(`request to ${request.url} failed, reason: ${err.message}`, 'system', err));
                               ^
FetchError: request to http://metadata/computeMetadata/v1/instance/attributes/cluster-name failed, reason: getaddrinfo ENOTFOUND metadata
    at ClientRequest.<anonymous> (/server/node_modules/node-fetch/lib/index.js:1491:11)
    at ClientRequest.emit (node:events:517:28)
    at Socket.socketErrorListener (node:_http_client:501:9)
    at Socket.emit (node:events:517:28)
    at emitErrorNT (node:internal/streams/destroy:151:8)
    at emitErrorCloseNT (node:internal/streams/destroy:116:3)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  type: 'system',
  errno: 'ENOTFOUND',
  code: 'ENOTFOUND'
}

In the short period of time when the ml-pipeline-ui is responsive before it's exiting with the above mentioned error code, I'm getting this error:

Could not get main container logs: Error: Unable to retrieve workflow status: [object Object].

Do you know what might be causing the issue?

kromanow94 commented 2 months ago

FYI, I was testing with kubeflow/manifests 1.9.0 and only changed the container images tags for pipeline components to 2.3.0.

droctothorpe commented 2 months ago

@kromanow94, someone in the (deprecated) Kubeflow slack workspace recently DMed me reporting what looks like a very similar error and a corresponding solution:

image
Could not get main container logs: Error: Unable to retrieve workflow status: [object Object].

That error is raised here. It is raised when the workflow corresponding to the run has been garbage collected.

When that fails, it should move on to trying to retrieve the logs directly from the artifact store. That sequencing (K8s API lookup -> WF manifest lookup -> artifact store lookup) is defined here. Note that the third step only executes if archiveLogs is true.

kromanow94 commented 2 months ago

Thanks, I solved the issue with the ml-pipeline-ui with the following workaround:

$ kubectl -n kubeflow expose deployment ml-pipeline --name metadata --dry-run=client -oyaml | yq 'del(.metadata.labels)' -y | kubectl apply -f -

It seems that I forgot about the env ARGO_ARCHIVE_LOGS="true" as well.

Now with those two issues resolved, there is some other issue I'm having right now. It seems the frontend uses the workflow pod name in place of the workflow name to get the logs.

The Argo Workflows and ml-pipeline-ui are set with the following key format:

artifacts/{{workflow.name}}/{{workflow.creationTimestamp.Y}}/{{workflow.creationTimestamp.m}}/{{workflow.creationTimestamp.d}}/{{pod.name}}

Now, I'm trying a scenario when I manually delete just one Workflow Pod and see if I can get the logs. I'm running that with curl with the following command and error:

$ curl "https://kubeflow.example.com/pipeline/k8s/pod/logs?podname=addition-pipeline-pb72t-3709849268&runid=62911214-2e5c-46b4-ae26-d660809b8a5b&podnamespace=glo-k8s-admin&createdat=2024-09-10" -H "Authorization: Bearer $(kubectl -n glo-k8s-admin create token default-editor)"
Could not get main container logs: S3Error: 503

Looking at logs from the ml-pipeline-ui, I see the following line:

GET /pipeline/k8s/pod/logs?podname=addition-pipeline-pb72t-3709849268&runid=62911214-2e5c-46b4-ae26-d660809b8a5b&podnamespace=glo-k8s-admin&createdat=2024-09-10
Getting logs for pod, addition-pipeline-pb72t-3709849268, from mlpipeline/artifacts/addition-pipeline-pb72t-3709849268/2024/09/10/addition-pipeline-pb72t-3709849268/main.log.

But, having a look at the storage in minio, the path seems to be different in the workflow name:

mlpipeline/artifacts/addition-pipeline-pb72t/2024/09/10/addition-pipeline-pb72t-3709849268/main.log

I also run the same test when both the ArgoWF and its Pods were deleted with the same result.

kromanow94 commented 2 months ago

I also run another scenario with one of the default pipelines. Using the Run of [Tutorial] DSL - Control structures will create a set of Pods with the following names:

tutorial-control-flows-gbjpn-system-dag-driver-1941995042
tutorial-control-flows-gbjpn-system-dag-driver-2529298331
tutorial-control-flows-gbjpn-system-container-driver-3821465773
tutorial-control-flows-gbjpn-system-container-driver-467613593
tutorial-control-flows-gbjpn-system-container-impl-2018360503
tutorial-control-flows-gbjpn-system-dag-driver-3793006146
tutorial-control-flows-gbjpn-system-container-driver-1781539373
tutorial-control-flows-gbjpn-system-dag-driver-1653906037
tutorial-control-flows-gbjpn-system-dag-driver-477249553
tutorial-control-flows-gbjpn-system-dag-driver-3564565339
tutorial-control-flows-gbjpn-system-dag-driver-511401279
tutorial-control-flows-gbjpn-system-container-impl-1647980155
tutorial-control-flows-gbjpn-system-container-driver-133111245
tutorial-control-flows-gbjpn-system-container-impl-902191899
tutorial-control-flows-gbjpn-system-dag-driver-2964226956
tutorial-control-flows-gbjpn-system-dag-driver-820485803
tutorial-control-flows-gbjpn-system-container-driver-2311769198
tutorial-control-flows-gbjpn-system-container-impl-3092937256

With the pipeline above, I confirm the mechanism works and I was able to get to the logs through the artifacts when the Pod was no longer available on the cluster.

For my other test I used pipeline created with the following script:

from kfp import dsl
import kfp

client = kfp.Client()

experiment_name = "my-experiment"
experiment_namespace = "glo-k8s-admin"

@dsl.component
def add(a: float, b: float) -> float:
    """Calculates sum of two arguments"""
    return a + b

@dsl.pipeline(
    name="Addition pipeline",
    description="An example pipeline that performs addition calculations.",
)
def add_pipeline(
    a: float = 1.0,
    b: float = 7.0,
):
    first_add_task = add(a=a, b=4.0)
    second_add_task = add(a=first_add_task.output, b=b)

try:
    print("getting experiment...")
    experiment = client.get_experiment(
        experiment_name=experiment_name, namespace=experiment_namespace
    )
    print("got experiment!")
except Exception:
    print("creating experiment...")
    experiment = client.create_experiment(
        name=experiment_name, namespace=experiment_namespace
    )
    print("created experiment!")

client.create_run_from_pipeline_func(
    add_pipeline,
    arguments={"a": 7.0, "b": 8.0},
    experiment_id=experiment.experiment_id,
    enable_caching=False,
)

Which creates Pods with following names (the name differs from example in previous comment because it's a new run):

addition-pipeline-8fbx5-1329381822
addition-pipeline-8fbx5-2476489665
addition-pipeline-8fbx5-2909346888
addition-pipeline-8fbx5-579177391
addition-pipeline-8fbx5-822281223

If I understand correctly, this is because of how the Workflow Name is being provided by formatting the pod name:

https://github.com/kubeflow/pipelines/blob/2.3.0/frontend/server/workflow-helper.ts#L169

https://github.com/kubeflow/pipelines/blob/2.3.0/frontend/server/workflow-helper.ts#L214

Is this because of the difference in how the KFPv1 and KFPv2 works?

droctothorpe commented 2 months ago

v2 pod names include suffixes like the first example you shared. Is the Addition pipeline pipeline you linked authored / submitted with KFP v1? If so, the logs should be stored as proper output artifacts that you can access in the Inputs/Outputs tab when you click on the node. That being said, it shouldn't be too difficult to add support for this alternative log retrieval mechanism in v1 if there is a demand for it. It's a long-standing issue in v1, v1 is in somewhat of a frozen state, and there's an alternative solution in place (the Inputs/Outputs tab), so this PR didn't fix the logs tab in v1.

droctothorpe commented 2 months ago

Also, in case it's related, please make note of this issue: https://github.com/kubeflow/pipelines/issues/11019.

droctothorpe commented 2 months ago

That being said, continuing to do some extra validation on my end to see if I can repro the error.

kromanow94 commented 2 months ago

Hey @droctothorpe , were you able to have a closer look here?

TBH, I'm not exactly sure if the Pipeline I created using the script from https://github.com/kubeflow/pipelines/issues/10036#issuecomment-2340206338 is KFP v1 or KFP v2. I deployed it using kfp==2.8.0 so I assumed it's KFP v2. Do you know if I should change anything in the script to explicitly configure it for KFP v2?

droctothorpe commented 2 months ago

When I ran that same pipeline, my pods had different names (i.e. they adhered to the v2 nomenclature):

image

Are you positive that you submitted the run with v2?

droctothorpe commented 2 months ago

Here are the logs in the UI for that pipeline even after the pods and workflow CR have been GCed:

image
droctothorpe commented 2 months ago

Hey, I noticed the PR is already included in the 2.3.0. I'm trying to run some tests but I get the following error and I'm not able to get the logs from the Pod Step. I'm having two issues:

The ml-pipeline-ui Pod is exiting the main process with the following error:

/server/node_modules/node-fetch/lib/index.js:1491
                        reject(new FetchError(`request to ${request.url} failed, reason: ${err.message}`, 'system', err));
                               ^
FetchError: request to http://metadata/computeMetadata/v1/instance/attributes/cluster-name failed, reason: getaddrinfo ENOTFOUND metadata
    at ClientRequest.<anonymous> (/server/node_modules/node-fetch/lib/index.js:1491:11)
    at ClientRequest.emit (node:events:517:28)
    at Socket.socketErrorListener (node:_http_client:501:9)
    at Socket.emit (node:events:517:28)
    at emitErrorNT (node:internal/streams/destroy:151:8)
    at emitErrorCloseNT (node:internal/streams/destroy:116:3)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  type: 'system',
  errno: 'ENOTFOUND',
  code: 'ENOTFOUND'
}

In the short period of time when the ml-pipeline-ui is responsive before it's exiting with the above mentioned error code, I'm getting this error:

Could not get main container logs: Error: Unable to retrieve workflow status: [object Object].

Do you know what might be causing the issue?

FYI adding the following env var to the ml-pipeline-ui might be an easier fix for this error:

- name: DISABLE_GKE_METADATA
   value: "true"

Shout out to @boarder7395 for sharing it.

boarder7395 commented 2 months ago

@kromanow94 I think you might be having the same issue I did.

The two changes required for everything to work are

What I missed was the second item which resulted in the issues you had above.

kromanow94 commented 3 weeks ago

Hey, I was able to get the main.log artifact after the Workflow and Pods were deleted with latest version from kubeflow/manifests v1.9.1 which uses the Kubeflow Pipelines version 2.3.0.

I only had to put those two env variables for the ml-pipeline-ui Deployment and all works:

- name: ARGO_ARCHIVE_LOGS
  value: "true"
- name: DISABLE_GKE_METADATA
  value: "true"

Though still a workaround in code, it works.

Are there any plans for the proper solution that registers the main.log as artifact?