prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
15.89k stars 5.32k forks source link

Inconsistent http response between presto-native and presto-spark during error #19609

Open vermapratyush opened 1 year ago

vermapratyush commented 1 year ago

The interaction between presto-native cpp server and presto-spark during unhappy flows is not consistent.

In the API TaskResource#createOrUpdateTaskImpl (https://github.com/prestodb/presto/blob/e1bb41a2ffb183c822c0eb27b8ed2c8c04e088d5/presto-native-execution/presto_cpp/main/TaskResource.cpp#L237), when a VELOX exception is thrown response has a status code of 200 with the message body containing exception error message. When the API throws non-velox exception, a 5XX response is returned with the exception's error message (but no stacktrace).

In TaskResource::deleteTask (https://github.com/prestodb/presto/blob/e1bb41a2ffb183c822c0eb27b8ed2c8c04e088d5/presto-native-execution/presto_cpp/main/TaskResource.cpp#L330), Velox and non-velox exception, both are handled the same way and 500 response is sent along with error message (yet no stacktrace).

There are few problems with this approach:

  1. Clients needs to handle both the scenarios differently adding to complexity. This affects how retries are to be handled.
  2. The lack of stacktrace in the error message makes it difficult to debug issues.

Some proposal to the problem

  1. Make the HTTP APIs RESTful, this would mean leverage the HTTP response appropriately.
    1. Any internal server error should have 5XX status code, this includes velox and non-velox error
    2. Any user error is 4XX, example: VELOXUSER* are 4xx.
  2. 4XX and 5XX are allowed to have http response body, continue to pass appropriate error message to the client for error categorisation.
  3. ~Use boost::exception to throw exceptions. boost::exception provides a much structured format for exception and includes stacktrace.~ VELOX_CHECK and VELOX_USERCHECK already captures stack trace, extend the functionality and surface it as PRESTO* macros with mirroring funcitonality.

If we go ahead with proposal 1 and 2, would we need to modify how presto coordinator handles http status code?

vermapratyush commented 1 year ago

cc @mbasmanova

mbasmanova commented 1 year ago

CC: @spershin @amitkdutta @tanjialiang @xiaoxmeng

mbasmanova commented 1 year ago

Proposals 1 and 2 make sense to me. Let's make these changes unless there are objections.

Use boost::exception to throw exceptions. boost::exception provides a much structured format for exception and includes stacktrace.

This one I'm not sure about. In Velox, we use VELOX_CHECK and VELOX_USER_CHECK macros to throw exceptions. These generate stacktraces as long as that functionality is enabled via gflags.

// Used in common/base/VeloxException.cpp
DEFINE_bool(
    velox_exception_user_stacktrace_enabled,
    false,
    "Enable the stacktrace for user type of VeloxException");

DEFINE_bool(
    velox_exception_system_stacktrace_enabled,
    true,
    "Enable the stacktrace for system type of VeloxException");

DEFINE_int32(
    velox_exception_user_stacktrace_rate_limit_ms,
    0, // effectively turns off rate-limiting
    "Min time interval in milliseconds between stack traces captured in"
    " user type of VeloxException; off when set to 0 (the default)");

DEFINE_int32(
    velox_exception_system_stacktrace_rate_limit_ms,
    0, // effectively turns off rate-limiting
    "Min time interval in milliseconds between stack traces captured in"
    " system type of VeloxException; off when set to 0 (the default)");
mbasmanova commented 1 year ago

CC: @miaoever

mbasmanova commented 1 year ago

CC: @majetideepak

vermapratyush commented 1 year ago

I see, VELOX_CHECK and VELOX_USERCHECK indeed captures the stacktrace. In that case, we could extend and provide PRESTO macros which mirror their respective VELOX_ functionality.

vermapratyush commented 1 year ago

If we go ahead with proposal 1 and 2, would we need to modify how presto coordinator handles http status code?

amitkdutta commented 1 year ago

@vermapratyush Thanks for creating the issue. This is a great finding. Looking at https://github.com/prestodb/presto/blob/e1bb41a2ffb183c822c0eb27b8ed2c8c04e088d5/presto-native-execution/presto_cpp/main/TaskResource.cpp#L237-L244

It appears the worker is relying on coordinator to classify the error. It creates an empty task and with exception details.

I am thinking why we don't use same thing for non velox exception as well. Basically create an empty tasks and put message for following sections. Then co-ordinator needs no change.

        } catch (const std::exception& e) {
          http::sendErrorResponse(downstream, e.what());
          return;
        }
vermapratyush commented 1 year ago

@amitkdutta

I am thinking why we don't use same thing for non velox exception as well. Basically create an empty tasks and put message for following sections. Then co-ordinator needs no change.

The assumption here is that coordinator does not rely on http status code to make decisions and we are ok with changing http status code from 5xx to 2xx along with change to payload structure (only for current 5xx response).

Also, if we go this route, what do we do for other APIs like TaskResource::deleteTask. They will still be returning 5XX with a non-json response body, which would be different from TaskResource::createOrUpdateTask

amitkdutta commented 1 year ago

@vermapratyush Makes sense. I think our best option is to read the coordinator side code and decide based on that. Otherwise we end with random items :(. CC: @pranjalssh @rschlussel @arunthirupathi for design help here.

rschlussel commented 1 year ago

A quick look at https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/server/remotetask/HttpRemoteTask.java makes me think that we basically only have a failure response when there was some connection issue and execution failures will just show that info in the TaskStatus itself (maybe TaskState and ExecutionFailureInfo fields) https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/execution/TaskStatus.java. But I would defer to @ajaygeorge who knows much more about the worker <-> coordinator communication.

ajaygeorge commented 1 year ago

@vermapratyush Answering your offline question below .

How is the response body of a 5XX response processed in coordinator code?

Failure handling is different for the JSON path and thrift path. for json we have the SimpleHttpResponseHandler and for thrift we have a ThriftHttpResponseHandler assuming prestissimo is still on JSON, see how it is handled in SimpleHttpResponseHandler https://github.com/prestodb/presto/blob/3513d076c53e74d01871a163be304855f3194b8e/presto-main/src/main/java/com/facebook/presto/server/SimpleHttpResponseHandler.java#L53-L73