open-telemetry / opamp-spec

OpAMP Specification
Apache License 2.0
108 stars 34 forks source link

Should we use HTTP 503 response for throttling? #26

Closed tigrannajaryan closed 2 years ago

tigrannajaryan commented 2 years ago

The spec says that the server can respond with 503 to throttle the client.

However, reading the http response code may not be possible with some WebSocket clients, e.g. for browsers (it is disabled for security reasons).

Should we still support this throttling method or avoid it?

pmm-sumo commented 2 years ago

Also, this is a slightly different reason why being throttled, but should we also mention HTTP 429 Too Many Requests next to 503? (I believe the behavior would be the same as for 503 otherwise)

tigrannajaryan commented 2 years ago

I could never figure out what is the use case for 429 when 503 achieves exactly the same from client's perspective. 429 is also supposed to return exactly the same Retry-After header. The slight nuance seems to be that 503 means server is overloaded and unable to respond properly, while 429 says that it could respond but will not because the client is exceed some sort of quota.

However, I can't think of anything that the Agent could be doing differently in response to 503 and 429, so I don't know what we gain by adding both to the spec. Perhaps the benefit is in helping the troubleshooting from the Agent side (the Agent can log the response code)?

tigrannajaryan commented 2 years ago

I experimented a bit with 503 responses in Go and JS.

In Go Websocket client (WSGorilla) the response code is clearly accessible, so no problem here, it can be used to indicate throttling.

In JS the 503 response results in an "error" event. This event is indistinguishable from a connection error when the server is down. This seems to be OK, since the protocol defines that if the connection cannot be established exponential backoff should be used, which essentially almost is what we also want to happen for 503 responses (minus the ability to specify exact retry interval). So, for JS it also seems to be OK to use this approach.

We do not have another good way to convey the unavailability of the server, unless we mandate the server to accept the Websocket connection and then send a ServerErrorResponse message with UNAVAILABLE error. However sending the ServerErrorResponse message is more expensive than just returning 503, the Server may simply not be able to do that.

The only alternate I can think of is just get rid of 503 response altogether and say that when overloaded the server simply should reject connections. But this would reduce the troubleshootability for other language implementations where the response code is easily accessible.

Given the above I think we should keep 503 response for throttling, since it does not cause problem for JS, is useful for troubleshooting compared to rejecting connection, and is less expensive than the ServerErrorResponse message.

Note that we still do need ServerErrorResponse message response because the overloading situation may occur after the WebSocket connection exists for a while. In this case we don't want to just disconnect since it will result in immediate re-connection, we also can't send an HTTP response code because it is too late, so we need to convey this information via a WebSocket message.

Any thoughts?

tigrannajaryan commented 2 years ago

Coming back to 429. This response is typically tied to a particular client or group of clients. It may not be possible to meaningfully calculate the rate and send a 429 response before the Agent's instance uid is known. That happens after the HTTP response headers are sent, and by then it is too late to try to send 429 response back.

So, in practice it may be impossible to have per-Agent rate limiting and 429 responses tied to that. We may still be able to tie this to auth information which supposedly is available before the WebSocket is established. This auth is another open issue that we need to answer before we can be sure about this.

pmm-sumo commented 2 years ago

Yeah, I believe that 503 is more universal and handling it would be enough. I am just wondering if the specification should be prescriptive whether 429 could be used or not. I imagine that in certain environments there might be some 429 throttling capabilities available already and having agent supporting that would be helpful, especially since the agent would behave the same way as for 503

tigrannajaryan commented 2 years ago

So perhaps we can just say that either 429 or 503 SHOULD be used for throttling and from protocol perspective there is no difference (the Agent SHOULD be ready to handle both, the Server can choose to use one or the other or both as applicable).

pmm-sumo commented 2 years ago

So perhaps we can just say that either 429 or 503 SHOULD be used for throttling and from protocol perspective there is no difference (the Agent SHOULD be ready to handle both, the Server can choose to use one or the other or both as applicable).

Yes, that's exactly what I had on mind. I can prepare a quick PR with that change

For WebSockets, I asked @kkruk-sumo for consultation (we used to use that technology extensively in one of the previous projects), maybe he will have some suggestions here

tigrannajaryan commented 2 years ago

Yes, that's exactly what I had on mind. I can prepare a quick PR with that change

Please do.

tigrannajaryan commented 2 years ago

I think https://github.com/open-telemetry/opamp-spec/pull/37 resolves this. Closing, can reopen if needed.

kkruk-sumo commented 2 years ago

Unfortunately, JS needs a special treatment here and if it's needed to be supported, I would consider to:

  1. Accept connections and close them immediately with some arbitrary closing code, or,
  2. Accept connections (if there are not open yet), send a json indicating the error and retry-after parameter and close them, or,
  3. Close/reject connections and specify a minimum retry interval in the specification that JS clients needs to follow.
tigrannajaryan commented 2 years ago

@kkruk-sumo please clarify what you think will not work for JS with the spec as is. Returning 503 to JS results in a connection "error" event in the browser. The spec already defines what the client needs to do when to do when there is a connection error (exponential backoff). I think that should be sufficient for JS Agents. If you think otherwise please tell.

kkruk-sumo commented 2 years ago

@tigrannajaryan I believe that "The minimum recommended retry interval is 30 seconds." from the spec is enough for JS.