ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.95k stars 5.58k forks source link

[Core] Fix ray::Status <--> gRPC status interplay. #14278

Open clarkzinzow opened 3 years ago

clarkzinzow commented 3 years ago

There isn’t a clean ray::Status <--> gRPC status conversion for all ray::Statuses, yet we're pretending that there is by recasting every server-side ray::Status as an IOError client-side. This is very confusing to Ray devs when they, say, return a Status::ObjectNotFound status from the server that is recast as a generic Status::IOError on the client. We should fix this conversion so application code can properly exchange and interpret application-level errors, while maintaining support for transport-level gRPC errors.

Options

Two options are immediately apparent:

  1. Include application-level errors such as ray::Status in our proto payloads. Transport-level errors would still be handled via the gRPC status, but application-level errors that don’t map to gRPC status codes would be defined in the reply proto, alongside the normal payload. How aggressively we should try to map a subset of ray::Statuses to gRPC statuses requires some thought, e.g. should an "object not found" error be an application-level Status::ObjectNotFound error, or should that be mapped to the NOT_FOUND gRPC status code at the transport level? The best practice consensus is to avoid defining "specific resource has X state" codes when a generic "resource has X state" code exists, i.e. that we should use the NOT_FOUND gRPC status code where possible.
  2. Embrace the richer gRPC error model, which would keep transport-level and application-level errors consolidated in the gRPC status. The application-level ray::Status-esque errors would go into the error details. We would still do a best-effort mapping of ray::Status codes to gRPC status codes, falling back to gRPC status code UNKNOWN when a good mapping doesn't exist. Support for the richer error model exists for our core language (C++), our current frontend (worker) languages (Python, Java, C++), and potential future core languages (Go, Rust), but no support yet for grpc-web or Node.js. I think that this support will suffice for our needs, especially given that the richer error model should be opt-in for each RPC.

I believe that option (2), the richer error model, is the best approach.

rkooo567 commented 3 years ago

Added to core bug not to lose track of it. Feel free to remove it if you think it is not appropriate @ericl

ericl commented 3 years ago

cc @raulchen for thoughts