ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.76k stars 5.74k forks source link

[Ray release glue.py] `Error fetching logs: Must specify scd_id to fetch command logs. Did you already kick off a command?` #31877

Closed cadedaniel closed 1 year ago

cadedaniel commented 1 year ago

I'm seeing this in a few release tests, not sure how impactful it is: https://buildkite.com/ray-project/release-tests-branch/builds/1305#0185d81d-437e-49ff-9103-4bf92a4d11ea

[ERROR 2023-01-21 22:46:41,320] glue.py: 360  Error fetching logs: Must specify scd_id to fetch command logs. Did you already kick off a command?
--
  | ERROR:ray_release.logger:Error fetching logs: Must specify scd_id to fetch command logs. Did you already kick off a command?

cc @krfricke

krfricke commented 1 year ago

cc @Yard1

Yard1 commented 1 year ago

This issue happens when no command was ran on the cluster. It looks like, at least for the linked test, the cluster was terminated seconds after it started running, with no way for the command runner to start running the command. I think the timeout may simply be too low.

xwjiang2010 commented 1 year ago

Came across the same thing https://buildkite.com/ray-project/release-tests-branch/builds/1313#0185f02a-4915-412e-8dc8-21953b417b90 Command history is empty. Cluster was killed 3 seconds after cluster started running.

One suspicion after discussing with @krfricke is that the command submission failed and then the cluster just got terminated because of that.

cadedaniel commented 1 year ago

Seen again here https://buildkite.com/ray-project/release-tests-branch/builds/1392#01863e31-cbda-47f0-852d-887d37e41bb2

can-anyscale commented 1 year ago

sdk_runner has been deprecated, anyscale log is improved