Closed cadedaniel closed 1 year ago
cc @Yard1
This issue happens when no command was ran on the cluster. It looks like, at least for the linked test, the cluster was terminated seconds after it started running, with no way for the command runner to start running the command. I think the timeout may simply be too low.
Came across the same thing https://buildkite.com/ray-project/release-tests-branch/builds/1313#0185f02a-4915-412e-8dc8-21953b417b90 Command history is empty. Cluster was killed 3 seconds after cluster started running.
One suspicion after discussing with @krfricke is that the command submission failed and then the cluster just got terminated because of that.
sdk_runner has been deprecated, anyscale log is improved
I'm seeing this in a few release tests, not sure how impactful it is: https://buildkite.com/ray-project/release-tests-branch/builds/1305#0185d81d-437e-49ff-9103-4bf92a4d11ea
cc @krfricke