kubeflow / katib

Automated Machine Learning on Kubernetes
https://www.kubeflow.org/docs/components/katib
Apache License 2.0
1.51k stars 443 forks source link

[WIP] Add e2e test for `tune` api with LLM hyperparameter optimization #2420

Open helenxie-bit opened 2 months ago

helenxie-bit commented 2 months ago

What this PR does / why we need it: This PR adds an e2e test for the tune API, specifically for the scenario of importing external models and datasets for LLM hyperparameter optimization.

Which issue(s) this PR fixes _(optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged)_: Fixes #

Checklist:

google-oss-prow[bot] commented 2 months ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign johnugeorge for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubeflow/katib/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
helenxie-bit commented 2 months ago

/area gsoc

helenxie-bit commented 2 months ago

Ref: https://github.com/kubeflow/katib/issues/2339

helenxie-bit commented 2 months ago

The e2e test for the tune API has been consistently failing due to a "Timeout Error," and I have been investigating the root cause. I set the retain_trials parameter to True and retrieved the logs from the pod in the Experiment. The logs revealed that both the pytorch container and the metrics-logger-and-collector container exited with an Error 137.

When I ran kubectl describe pod $POD_NAME -n default, I noticed the following events. One specific event, "SandboxChanged," stood out as potentially problematic:

Events:
  Type    Reason          Age                    From               Message
  ----    ------          ----                   ----               -------
  ...
  Normal  SandboxChanged  3m (x2 over 3m43s)     kubelet            Pod sandbox changed, it will be killed and re-created.
  ...

However, when I checked the pod logs using kubectl logs $POD_NAME -n default --all-containers, everything appeared normal, and the logs confirmed that "Training is complete."

I also examined the kubelet and container runtime logs. While the kubelet logs provided no additional insights, the container runtime logs displayed the following error, which I believe may be related to the issue:

Sep 29 19:59:04 fv-az1986-610 dockerd[3342]: time="2024-09-29T19:59:04.631799544Z" level=info msg="Container failed to exit within 30s of signal 15 - using the force" container=bfff1b5f24d7ebcdc51d0dabe807e391053c4a4065a404203e266c5341bbfbbe spanID=6c2dd21dc1394346 traceID=738eda3fc86a653490d5534deb664c93

@andreyvelich @tenzen-y Do you have any thoughts on how to resolve this issue?