Open helenxie-bit opened 2 months ago
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign johnugeorge for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
/area gsoc
The e2e test for the tune
API has been consistently failing due to a "Timeout Error," and I have been investigating the root cause. I set the retain_trials
parameter to True
and retrieved the logs from the pod in the Experiment. The logs revealed that both the pytorch
container and the metrics-logger-and-collector
container exited with an Error 137.
When I ran kubectl describe pod $POD_NAME -n default
, I noticed the following events. One specific event, "SandboxChanged," stood out as potentially problematic:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
...
Normal SandboxChanged 3m (x2 over 3m43s) kubelet Pod sandbox changed, it will be killed and re-created.
...
However, when I checked the pod logs using kubectl logs $POD_NAME -n default --all-containers
, everything appeared normal, and the logs confirmed that "Training is complete."
I also examined the kubelet and container runtime logs. While the kubelet logs provided no additional insights, the container runtime logs displayed the following error, which I believe may be related to the issue:
Sep 29 19:59:04 fv-az1986-610 dockerd[3342]: time="2024-09-29T19:59:04.631799544Z" level=info msg="Container failed to exit within 30s of signal 15 - using the force" container=bfff1b5f24d7ebcdc51d0dabe807e391053c4a4065a404203e266c5341bbfbbe spanID=6c2dd21dc1394346 traceID=738eda3fc86a653490d5534deb664c93
@andreyvelich @tenzen-y Do you have any thoughts on how to resolve this issue?
What this PR does / why we need it: This PR adds an e2e test for the
tune
API, specifically for the scenario of importing external models and datasets for LLM hyperparameter optimization.Which issue(s) this PR fixes _(optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged)_: Fixes #Checklist: