ninenines / gun

HTTP/1.1, HTTP/2, Websocket client (and more) for Erlang/OTP.
ISC License
891 stars 232 forks source link

Check max_concurrent_streams before trying to open a new http2 stream #246

Closed bobergj closed 3 years ago

bobergj commented 3 years ago

According to the http/2 spec, https://tools.ietf.org/html/rfc7540:

5.1.2. Stream Concurrency A peer can limit the number of concurrently active streams using the SETTINGS_MAX_CONCURRENT_STREAMS parameter ... Endpoints MUST NOT exceed the limit set by their peer.

Also:

6.5.2. Defined SETTINGS Parameters SETTINGS_MAX_CONCURRENT_STREAMS (0x3) It is recommended that this value be no smaller than 100, so as to not unnecessarily limit parallelism.

The SETTINGS_MAX_CONCURRENT_STREAMS parameter, just like the other settings, can change during the duration of a connection.

Now, Apple's HTTP/2 APNS service is slightly unusual in that it sets the SETTINGS_MAX_CONCURRENT_STREAMS to 1 initially. After an initial, successfully authenticated, request on the connection, this setting value is raised, typically to 1000.

If you perform two gun:post requests simultaneously on a newly opened connection to the Apple APNS service, the second one will fail with a stream_error. Worse, since gun does not check the stream count against the setting before opening a new stream, the second stream open attempt hits the remote peer before being rejected. Since HTTP spec recommendation is to have a higher initial value of SETTINGS_MAX_CONCURRENT_STREAMS, this issue won't typically be seen at low levels of load. But if you use gun in a server that is under high load, it's not unthinkable that 101 processes are waiting for a connection to open to perform a request. Then this could fail also for the http2 spec recommended initial SETTINGS_MAX_CONCURRENT_STREAMS value of 100.

I have opened a proof of concept PR https://github.com/ninenines/gun/pull/245 that checks the max_concurrent_streams before opening a new stream. The PR may solve part of the problem in that we don't hit the remote peer for streams over the limit. But also there's may a more profound issue here since the gun API design implies that you can perform requests "without worrying", while the gun process will ensure there is an open connection. But since gun internally doesn't queue up requests to wait for an available stream slot, as it does for a connection, you actually have to worry a bit.

Sample code (non including all dependencies, only for purposes of illustrating API usage):

test_push(DeviceId) ->
    {ok, ConnPid} = gun:open("api.development.push.apple.com", 443),
    {ok, _Protocol} = gun:await_up(ConnPid),
    RequestBody = "{ \"aps\" : { \"alert\" : \"Hello\" } }",
    spawn_link(fun() ->
        StreamRef = gun:post(ConnPid, 
            [<<"/3/device/">>, DeviceId], 
            [
                {<<"authorization">>, [<<"bearer ">>, auth_token()]},
                {<<"apns-topic">>, <<"test">>},
                {<<"apns-expiration">>, <<"0">>}
            ],
            RequestBody),
            case gun:await(ConnPid, StreamRef) of
                {response, fin, _Status, _Headers} ->
                    no_data;
                {response, nofin, _Status, _Headers} ->
                    {ok, Resp1} = gun:await_body(ConnPid, StreamRef),
                    io:format("~p~n", [Resp1])
            end
    end),
    spawn_link(fun() -> 
        StreamRef2 = gun:post(ConnPid, 
            [<<"/3/device/">>, DeviceId], 
            [
                {<<"authorization">>, [<<"bearer ">>, auth_token()]},
                {<<"apns-topic">>, <<"test">>},
                {<<"apns-expiration">>, <<"0">>}
            ],
            RequestBody),
            case gun:await(ConnPid, StreamRef2) of
                {response, fin, _Status2, _Headers2} ->
                    no_data;
                {response, nofin, _Status2, _Headers2} ->
                    Resp2 = gun:await_body(ConnPid, StreamRef2),
                    io:format("~p~n", [Resp2])
        end
    end).

Fails with:

Error in process <0.1080.0> with exit value:
{{case_clause,
     {error,
         {stream_error,
             {stream_error,refused_stream,'Stream reset by server.'}}}},
essen commented 3 years ago

The solution in the PR is wrong because it checks against options and not against the current settings for the connection. The real fix should be made in cow_http2_machine (or in both Gun and there, if necessary) because it has access to the current settings.

I don't think it's a problem not to queue up, because if we did there would be the tough question of what to do when disconnects occur, because not all requests can be safely replayed or safely executed right after previous requests have failed to complete. Therefore if we did queue up we would have to stream_error on disconnect. Therefore I think we should just send a stream_error to the user in all cases including when the user is trying to send too many requests. Only in this case the stream_error occurs as a result of a protocol constraint and the request is never sent to the server.

bobergj commented 3 years ago

The solution in the PR is wrong

Indeed, looking at cow_http2_machine again, the PR doesn't make any sense. Closed.

Therefore if we did queue up we would have to stream_error on disconnect

A stream_error on disconnect would be expected, and not an issue (at least not in my use case). I mean, for the connection to suddenly close is not on the normal, happy path.

when the user is trying to send too many requests

The problem is that there is not a way to know what is too many requests in advance, as the SETTINGS_MAX_CONCURRENT_STREAMS is variable. The http2 spec even allows for a peer to lower this setting during a connections lifetime. If assuming a value of 100 (the http2 spec recommendation), that wouldn't use all the available connection throughput on servers that allow for more concurrent streams.

Therefore I think we should just send a stream_error to the user in all cases including when the user is trying to send too many requests

Some kind of mechanism to request a stream, and wait for it to be available, would be very convenient. Repeating the example from above, imagine there is a sudden load spike, and 101 Erlang processes who are trying to send a request over a connection that has a peer setting SETTINGS_MAX_CONCURRENT_STREAMS of 100. Now, the 101'th process will get the stream_error. The only thing this process can do to progress successfully - rather than propagate the error - is to busy-wait and retry N times. If it could instead ask gun to give it a stream, as soon as available (with a timeout parameter, naturally), there would be back-pressure throughout the system without any polling.

essen commented 3 years ago

The problem with the max setting is that it can change at any time and therefore even if we tell the user process about it this wouldn't guarantee that a stream is available by the time the request message is sent. You could still get a stream_error even after Gun tells you there's a stream available if the server lowers the max in-between.

A mechanism to request a stream would have the same problems.

Anyway I think it's best to focus on practical applications. Right now your APNS case is the only occurence of such a problem, and therefore it's best to focus on that. In your case the solution could be to do a single request immediately after opening the connection before allowing the connection to be used for more. In effect you'd have an intermediate state where the connection can only be used once, before it becomes usable for concurrent requests.

The stream_error should probably not be sent by the server but by the client directly because it saves a round trip. Also the error would most likely be clearer. However it's a special case so it might not be worth doing, and also some people might still want to send more requests than permitted. So I'm not sure about that.

It could be useful to expose the current settings via gun:info or other. Maybe have a settings_changed event as well.

bobergj commented 3 years ago

@essen:

Anyway I think it's best to focus on practical applications

While I appreciate your suggestions for how to solve my specific apns problem, I opened this issue more from a http2 spec point of view.

Right now your APNS case is the only occurrence of such a problem, and therefore it's best to focus on that

Sorry, I confounded multiple different issues in this discussion:

  1. gun not limiting the number of streams to SETTINGS_MAX_CONCURRENT_STREAMS. Note that this is a MUST in the RFC, and some server implementations will even error the whole connection with a PROTOCOL_ERROR when the client doesn't respect the setting.
  2. gun API contract with regards to queuing vs raising an error on reaching the remote peer SETTINGS_MAX_CONCURRENT_STREAMS
  3. The lack of a gun public API that exposes the current SETTINGS_MAX_CONCURRENT_STREAMS setting value
  4. The peculiarity of the APNS server of having an initial SETTINGS_MAX_CONCURRENT_STREAMS of 1

I agree that 4) here is only really relevant to APNS. Virtually every other service tend to set the limit to the recommenced 100 or higher. It seems to me that the other issues, in particular 1), is relevant for virtually any heavy-duty use of this library.

You could still get a stream_error even after Gun tells you there's a stream available if the server lowers the max in-between. A mechanism to request a stream would have the same problems.

Actually there are some provisions to solve this particular problem in the HTTP/2 spec, under 5.1.2. Stream Concurrency. And these provisions are implemented in actual HTTP/2 servers. For example, the Go net/http2 server: https://go.googlesource.com/net/+/master/http2/server.go If the SETTINGS frame has been acked it treats streams over the limit as a PROTOCOL_ERROR, but if the SETTINGS frame ack has not yet been received, it gives back a REFUSED_STREAM, and the latter error is retriable by the client.

It could be useful to expose the current settings via gun:info or other

When you have a moment, please check what you think about https://github.com/ninenines/gun/pull/247 and https://github.com/ninenines/cowlib/pull/107 which adds the remote settings to gun:info.

essen commented 3 years ago

There's no doubt that it's something that will need to be better handled eventually, but there's a lot of things that need to be better handled (or just handled in any shape or form) and until someone runs into this it's not worth doing much about it. Not to mention the better insight we get from real world feedback.

You could still get a stream_error even after Gun tells you there's a stream available if the server lowers the max in-between. A mechanism to request a stream would have the same problems.

Actually there are some provisions to solve this particular problem in the HTTP/2 spec, under 5.1.2. Stream Concurrency. And these provisions are implemented in actual HTTP/2 servers. For example, the Go net/http2 server: https://go.googlesource.com/net/+/master/http2/server.go If the SETTINGS frame has been acked it treats streams over the limit as a PROTOCOL_ERROR, but if the SETTINGS frame ack has not yet been received, it gives back a REFUSED_STREAM, and the latter error is retriable by the client.

You missed my point. There's no problem in that area with Gun. The problem comes from the Erlang side of things. You may ask for a stream or the number of streams available or whatever, and then Gun receives a settings change via the socket before you actually create the stream on the wire.

If streams were reserved, waiting for them to be used before sending the ack would just lead to potential timeouts. If you checked the setting, there's not much Gun could do about you sending too many streams.

The solution is probably to postpone the relevant message(s) until a stream is freed, but it's much easier said than done. I don't think we can use the gen_statem interface for that. Also since the problem will also apply to HTTP/3 the solution is most likely not specific to HTTP/2.

essen commented 3 years ago

The #{notify_settings_changed => true} HTTP/2 option will make Gun send a {gun_notify, settings_changed, AllSettings} message when settings are changing. I believe this solves the more immediate problem, and it also solves a similar problem about Websocket over HTTP/2 where we have to wait for settings to be changed to initiate a Websocket stream.

I'll close this for now, and if it turns out that more is needed, like a guarantee that a reserved stream "succeeds", please open a new ticket. Thanks!

Note that the gun:info changes are still interesting to have, I'll get to them soon.