ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.18k stars 5.61k forks source link

[jobs] Return status code after job is completed #38142

Open richardliaw opened 1 year ago

richardliaw commented 1 year ago

Looks like the return code of a job isn't recorded in Ray. Could we log this or show it in dashboard? https://github.com/ray-project/ray/blob/5470671c5e5e14ed4afbb52ac4118accc1789cfd/dashboard/modules/job/job_manager.py#L449-L466

This would be better reflected on the Ray Dashboard to help users understand errors.

will leave it to @alanwguo to triage this.

architkulkarni commented 1 year ago

The logging part is done by https://github.com/ray-project/ray/pull/37273/

scottsun94 commented 1 year ago

The job exit code is logged and shown in job driver logs I guess? If so, users are able to see them in the driver logs in ray dashboard automatically.

sudhirn-anyscale commented 1 year ago

Can we tell customer this is resolved ? Also, I am referring original slack thread here.

architkulkarni commented 1 year ago

For the purposes of that thread it's resolved (make the exit code appear somewhere in the logs). It's resolved in the Ray nightly and in Ray 2.7.

But we'll leave this issue open to track the enhancement which is to show it in the dashboard.

sudhirn-anyscale commented 1 year ago

Thanks @architkulkarni . Do you which release is planned for enhancement (Ray dashboard display of error code)

architkulkarni commented 1 year ago

Not sure about the planning for the dashboard part, perhaps @alanwguo knows.

sudhirn-anyscale commented 1 year ago

@alanwguo - Following up on this

scottsun94 commented 1 year ago

image

Actually, we already allow users to view the job message if it failed. The return code is logged both in logs and message according to https://github.com/ray-project/ray/pull/37273. I think dashboard part is already there and we can close it. We don't have to separately show the status code. cc: @architkulkarni to confirm.

architkulkarni commented 1 year ago

I see, I think that's fine as a minimal way to get the exit code. A few thoughts:

@sudhirn-anyscale is the status quo enough for the users you're dealing with?

sudhirn-anyscale commented 1 year ago

@architkulkarni - Ideally customer would like to make SDK call on a job and see a return code in one of the status fields. IT does not have to be displayed on dashboard.

They would like to avoid searching logs for a error code because return code in logs could match to anything.

architkulkarni commented 1 year ago

@sudhirn-anyscale I see, that will be added by https://github.com/ray-project/ray/pull/39675 which will be in Ray 2.8. The exit code will appear in the JobInfo field returned by the CLI ray job info and the SDK get_job_info

sudhirn-anyscale commented 1 year ago

Thanks @architkulkarni . That answers what I was looking for.

scottsun94 commented 1 year ago

It looks a little weird that it's not labeled "Exit code: 42". If it's just a plain number, it might be confused for the last line of the user script's output (could be bad if they print out a list of numbers and intend to use the last number as their calculation result).

This is not how it looks like now. I just want to show that we have a way to show the message field of the job. And the exit code will be logged there.

Ideally it would appear in the GUI somewhere near Status: FAILED.

I think that the message button is good enough for now. We can add it if needed in the future. We can keep it open to track this.