Is it more reasonable to return 'Service Unavailable' than Internal Server Error when the server is overloaded?

Felix-zhoux commented 11 months ago

In the alternator interface, when the current server is under heavy pressure, some requests may cause timeout exception errors. At this time, returning http code 503 Service Unavailable will make it easier for the client to perceive what is happening compared to http code 500 Internal Server Error. Compared with the service quality after the load is reduced, It can heal itself. In other words, service unavailability is temporary.

ref:https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html#Programming.Errors.MessagesAndCodes:~:text=Service%20Unavailable%20(HTTP%20503)

nyh commented 11 months ago

I agree that the "500 Internal Server Error" seems very vague (can mean a lot of different things), and perhaps even scary, but I also want us to consider the following:

The DynamoDB's documentation https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html#Programming.Errors.MessagesAndCodes suggests that both for Internal Server Error (500) and Service Unavailable (503), the client's behavior should be the same - to retry. However, the document makes an interesting comment only about Internal Server Error:

You might encounter internal server errors while working with items. These are expected during the lifetime of a table. Any failed requests can be retried immediately. When you receive a status code 500 on a write operation, the operation may have succeeded or failed. If the write operation is a TransactWriteItem request, then it is OK to retry the operation. If the write operation is a single-item write request such as PutItem, UpdateItem, or DeleteItem, then your application should read the state of the item before retrying the operation, and/or use Condition expressions to ensure that the item remains in a correct state after retrying regardless of whether the prior operation succeeded or failed. If idempotency is a requirement for the write operation, please use TransactWriteItem, which supports idempotent requests by automatically specifying a ClientRequestToken to disambiguate multiple attempts to perform the same action.

In other words, the "503 Service Unavailable" error suggests the service was completely unavailable - your request was completely ignored and never done, and you should retry it, whereas "Internal Server Error" says that there was an error in the middle of processing your request, and it is not known whether your request succeeded or not.

Our timeout errors are more like Internal Server Error in this sense - it is possible (though not very likely) that the user received a timeout on a write when it has actually already succeeded, or in some way partially succeeded. If this is the case, it's not clear that 503 Service Unavailable is the right message.

In the past there has been some effort in Scylla to do "load shedding" in interactive workloads (see https://github.com/scylladb/scylladb/pull/8680): When the system is overloaded with interactive (unlimited concurrency) requests, don't continue fail requests after working on them for 2 seconds, but rather fail them immediately, with a separate "OVERLOADED" error. Such a request was never started, and can reasonably return 503 Service Unavailable in Alternator.

Note that I'm not completely shooting down your idea - I'm not the foremost expert in load shedding or overload handling, and I'm open to hear your idea and reasoning why you think your idea of converting timeouts to 503 errors is a good one.

Felix-zhoux commented 11 months ago

In fact, we want to use timeout exception as an evaluation indicator of service overload. We want to do more than dynamodb. For example, refer to aws S3 which may return 503 slow down when the service is overloaded(https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html#ErrorCodeList). Currently, slowdown is not defined in seastar http message, but 503 is defined. I think there is a certain consensus in the industry that 503 is used as slow down.

For the following two scenarios, we hope that the business can be treated differently:

For example, if two of the three nodes are down, an error 500 'InternalServerError Cannot achieve consistency level for cl' will be reported when writing. This kind of retry is unnecessary.
For timeout exceptions caused by network jitters, we can consider retrying. If all are judged by http msg instead of http code, the program code will not be concise.

nyh commented 11 months ago

In fact, we want to use timeout exception as an evaluation indicator of service overload. We want to do more than dynamodb. For example, refer to aws S3 which may return 503 slow down when the service is overloaded(https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html#ErrorCodeList). Currently, slowdown is not defined in seastar http message, but 503 is defined. I think there is a certain consensus in the industry that 503 is used as slow down.

I don't disagree, but the specific page you linked to talks about AWS Lambdas, not DynamoDB, and just one line up there is a "500 Internal Service Error" of "LambdaTimeout", suggesting they are also not sure if it's good or bad that timeouts should cause 500.

Anyway, none of this is related to Seastar's HTTP server, which can return any error code and any error string along with it. It is the DynamoDB protocol, and specifically https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html, which says which errors we should use when, and what a client (such as Amazon's AWS SDK) does when receiving each type of error.

For the following two scenarios, we hope that the business can be treated differently:
1. For example, if two of the three nodes are down, an error `500 'InternalServerError Cannot achieve consistency level for cl' ` will be reported when writing. This kind of retry is unnecessary.

Why is it unnecessary? Usually the application would want assume the failure is temporary and retry - until the application decides to time out. If the two nodes will be back in 5 seconds, wouldn't you want the application to hang for 5 seconds and then retry, instead of fail and stop?

I think the thinking in the DynamoDB protocol is that there are two types of errors: One is a permanent error, such as a syntax error in the request or attempt to use a table that doesn't exist. If you retry this query a thousand times, or retry it next year, it will still fail - because the query is wrong. In such a case, https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html says: "OK to retry? No".

The other type of error is a transient error - overload, network problem, node down, software glitch, etc. If you retry this query in a second, everything might be back to normal. In this case the doc says "OK to retry? Yes".

2. For `timeout exceptions` caused by network jitters, we can consider retrying.
   If all are judged by `http msg` instead of `http code`, the program code will not be concise.

Are you writing your own library to connect to DynamoDB or ScyllaDB Alternator, and want it to handle different types of errors differently, or are you suggesting that Amazon's official "AWS SDK" does this?

scylladb / scylladb

Is it more reasonable to return 'Service Unavailable' than Internal Server Error when the server is overloaded? #16262