ninenines / gun

HTTP/1.1, HTTP/2, Websocket client (and more) for Erlang/OTP.
ISC License
891 stars 232 forks source link

stream_error with retry #291

Open dubrovine opened 2 years ago

dubrovine commented 2 years ago

Hello

Gun vsn - 2.0.0-rc.2

I faced some stream_error issue with the latest version of gun

connection pool options for gun

Opts2 = Opts#{ connect_timeout => 5000, retry => 200, retry_timeout => 500 % 500ms }

here is the part of the code

`request(PoolName, Method, Path, ReqHeaders, ReqBody)->
    request(PoolName, Method, Path, ReqHeaders, ReqBody, 0, undefined).

request(_, _, _, _, _, 20, ErrLast)->
    ErrLast;
request(PoolName, Method, Path, ReqHeaders, ReqBody, Counter, _)->

    ConnPid   = erlpool:pid(PoolName),
    StreamRef = gun:request(ConnPid, Method, Path, ReqHeaders, ReqBody, #{reply_to => self()}),

    case gun:await(ConnPid, StreamRef, ?TIMEOUT) of

        {response, fin, Status, Headers} ->
            {ok, Status, Headers, <<>>};

        {response, nofin, Status, Headers} ->

            case gun:await_body(ConnPid, StreamRef, ?TIMEOUT) of

                {ok, Body} ->
                    {ok, Status, Headers, Body};

                {error, Err} = ErrLast ->
                    timer:sleep(200),
                    request(PoolName, Method, Path, ReqHeaders, ReqBody, Counter+1, ErrLast)
            end;

        {error, Err} = ErrLast ->
            timer:sleep(200),
            request(PoolName, Method, Path, ReqHeaders, ReqBody, Counter+1, ErrLast)

    end.`

a lot of time I receive the error

` <0.16797.7>@: [HTTP2_AF] ERR_AWAIT 'apns_voip-prod' {error,{stream_error,closed}}

<0.16797.7>@: [HTTP2_AF] ERR_AWAIT 'apns_voip-prod' {error,{stream_error,closed}}

<0.16797.7>@: [HTTP2_AF] ERR_AWAIT 'apns_voip-prod' {error,{stream_error,closed}}

<0.16797.7>@: [HTTP2_AF] ERR_AWAIT 'apns_voip-prod' {error,{stream_error,closed}}

<0.16797.7>@: [HTTP2_AF] ERR_AWAIT 'apns_voip-prod' {error,{stream_error,closed}}

<0.16797.7>@: [HTTP2_AF] ERR_AWAIT 'apns_voip-prod' {error,{stream_error,closed}}

<0.16797.7>@: [HTTP2_AF] ERR_AWAIT 'apns_voip-prod' {error,{stream_error,closed}}

<0.16797.7>@: [HTTP2_AF] ERR_AWAIT 'apns_voip-prod' {error,{stream_error,closed}}

<0.16797.7>@: [HTTP2_AF] ERR_AWAIT 'apns_voip-prod' {error,{stream_error,closed}}

<0.16797.7>@: [HTTP2_AF] ERR_AWAIT 'apns_voip-prod' {error,{stream_error,closed}} `

Yes it is working as it should after retrying but I wanted to ask if it's good decision to make it work this way ?

I also tried to use settings like max_concurent_streams (200) but had no luck (Still reproducing such error messages)

Could you please point me to the right direction to (maybe) avoid this type of messages and re-try function (sometimes I need to execute function 15 times before it pass correctly)

thanks in advance

essen commented 2 years ago

Not enough info I'm afraid. I don't know where this message comes from. Which pool are you using? What server? Any way I can reproduce?

dubrovine commented 2 years ago

Pool is erlPool (10sockets for each PoolName) Server is cowboy 2.9.0 (If I understand you correctly)

`open(PoolName, PoolSize, Host, Port, Opts)->

    Opts2 = Opts#{
        connect_timeout => 5000,
        retry => 200,
        retry_timeout => 500 % 500ms
    },

    PoolRes = 
    erlpool:start_pool(PoolName, [
        {size, PoolSize},
        {group, http2_pools},
        {start_mfa, {gun, start_link, [self(), util:to_list(Host), Port, Opts2]}},
        {supervisor_period, 1},
        {supervisor_intensity, 1000},
        {supervisor_restart, permanent}
    ]),
    case PoolRes of
        ok -> 
            log:info("[HTTP2] START_POOL ~p ~p ~p", [self(), PoolName, PoolSize]),
            ok;
        {error,{already_started,_}} -> 
            log:error("[HTTP2] START_POOL already_started ~p ~p ~p", [self(), PoolName, PoolSize]),
            ok;
        PoolErr ->
            log:error("[HTTP2] START_POOL_ERR ~p ~p ~p ~p", [self(), PoolName, PoolSize, PoolErr]),
            {error, PoolErr}
    end.`

error message I receive directly from gun:await(ConnPid, StreamRef, ?TIMEOUT) unfortunately I can not reproduce the issue locally of on dev environment (only in production env)

we have like 30-50k requests to apns per minute

please tell me if you need more specific information

essen commented 2 years ago

Oh it's against APNS? If I recall they were doing something weird where they only allow 1 stream initially and then increase that to 100 after the first request succeeded. You might be running into that. Try to trace the HTTP/2 settings that Gun receives.

dubrovine commented 2 years ago

Okay thank you. I will close the ticket for now and get back to you when I will be able to trace settings on production env

dubrovine commented 2 years ago

Here is what I found

connected to active production node make some manual tests

OK TEST

`(channelapi@16.38.26.28)6> ConnPid = erlpool:pid('apns_privat24-2.0-voip-prod'). 
<0.2411.0>
(channelapi@16.38.26.28)7> 
(channelapi@16.38.26.28)7> StreamRef = gun:request(ConnPid, <<"POST">>, <<"/3/device/***">>, [{<<"apns-topic">>, <<"ua.hidden">>}, {<<"apns-push-type">>,<<"voip">>}], <<"{\"aps\" : {\"badge\" : 9,\"sound\" : \"bingbong.aiff\"},\"messageID\" : \"ABCDEFGHIJ\"}">>, #{reply_to => self()}).
#Ref<0.1992702093.3131834371.253809>
(channelapi@16.38.26.28)8> gun:info(ConnPid).
#{cookie_store => undefined,intermediaries => [],
  origin_host => "api.push.apple.com",origin_port => 443,
  origin_scheme => <<"https">>,owner => <0.2178.0>,
  protocol => http2,
  sock_ip => {16,38,26,28},
  sock_port => 52616,
  socket =>
      {sslsocket,{gen_tcp,#Port<0.453357>,tls_connection,
                          undefined},
                 [<0.21000.64>,<0.17325.65>]},
  transport => tls}
(channelapi@16.38.26.28)9> gun:stream_info(ConnPid , StreamRef).
{ok,#{ref => #Ref<0.1992702093.3131834371.253809>,
      reply_to => <0.27133.65>,state => running}}
(channelapi@16.38.26.28)10> gun:await(ConnPid, StreamRef, 5000).
{response,fin,200,
          [{<<"apns-id">>,
            <<"016B51EA-EF0D-E5C9-772E-E3FD48F6B1AE">>}]}`

ERROR TEST

`(channelapi@16.38.26.28)6> ConnPid1 = erlpool:pid('apns_privat24-2.0-voip-prod'). 
<0.2413.0>
(channelapi@16.38.26.28)7> 
(channelapi@16.38.26.28)7> StreamRef1 = gun:request(ConnPid1, <<"POST">>, <<"/3/device/***">>, [{<<"apns-topic">>, <<"ua.hidden">>}, {<<"apns-push-type">>,<<"voip">>}], <<"{\"aps\" : {\"badge\" : 9,\"sound\" : \"bingbong.aiff\"},\"messageID\" : \"ABCDEFGHIJ\"}">>, #{reply_to => self()}).
#Ref<0.1992702093.3132620803.58733>
(channelapi@16.38.26.28)8> gun:info(ConnPid1).
#{cookie_store => undefined,intermediaries => [],
  origin_host => "api.push.apple.com",origin_port => 443,
  origin_scheme => <<"https">>,owner => <0.2178.0>,
  protocol => http2,
  sock_ip => {16,38,26,28},
  sock_port => 50478,
  socket =>
      {sslsocket,{gen_tcp,#Port<0.491351>,tls_connection,
                          undefined},
                 [<0.5892.68>,<0.19792.67>]},
  transport => tls}
(channelapi@16.38.26.28)9> gun:stream_info(ConnPid1 , StreamRef1).
{ok,undefined}
(channelapi@16.38.26.28)10> gun:await(ConnPid1, StreamRef1, 5000).
{error,{stream_error,closed}}`

all the data seems like the same except connection pid and stream ref but gun:stream_info respond with undefined

essen commented 7 months ago

All I know about this is that it likely happens because the client was disconnected. Most likely it received an {error, closed} back after trying to do a socket operation. You would get a more verbose stream_error tuple otherwise. I do not know why that happens or what to do about it. Having a trace of what Gun is doing when this happens may help.

Vkutovoy92 commented 7 months ago

Hi, I have the same, with 1.3.0 I had {error,{stop,{goaway,215,no_error,<<>>},'Client is going away.'}} {error,{stop,{goaway,145,no_error,<<>>},'Client is going away.'}} {error,{stop,{goaway,143,no_error,<<>>},'Client is going away.'}}

And after update to 2.0.1 these problems are not exist but I have had new error {stream_error,{closed,{error,closed}}} And this I can catch in condition:

`

{response, nofin, HttpCode = 200,  _Headers} ->
  case gun:await_body(ConnPid, StreamRef, ?TIMEOUT) of
    {ok, Body}  ->
      FinishTime = time_util:militime() - StartTime,
      metrics:instrument(FinishTime, UrlForMetrics, true, HttpCode),
      DecodeBody = json:decode(Body),
      {prepare_response(DecodeBody), FinishTime};
    {error,{stream_error,{closed,{error,closed}}}} = Error ->
      log:error("{"?MODULE_STRING"}  - stream_error, error ~p", [Error]),
      FinishTime = time_util:militime() - StartTime,
      metrics:instrument(FinishTime, UrlForMetrics, false, stream_error),
      ?ErrorResp(FinishTime)
  end;

`

Vkutovoy92 commented 7 months ago

I have noticed that this error turns up on highload services and almost at the same time. 2 screens from 2 different services Снимок экрана 2023-11-20 в 17 13 14

Снимок экрана 2023-11-20 в 17 13 09

essen commented 7 months ago

Could be happening if the network has hiccups. You should see TCP connections getting closed at the same time.