ninenines / gun

HTTP/1.1, HTTP/2, Websocket client (and more) for Erlang/OTP.
ISC License
891 stars 232 forks source link

Downgrade response time HTTP-2 after 1.3.0 to 2.0.1 #321

Open Vkutovoy92 opened 7 months ago

Vkutovoy92 commented 7 months ago

Hello! We have a huge downgrade response time after updating to 2.0.1 after 1.3.0. My config for client...

Opts = #{ connect_timeout => 5000, retry => 100, retry_timeout => 1000 % 1s }, Args = [ {size, PoolSize}, {start_mfa, {gun, start_link, [self(), Url, 443, Opts]}}, {supervisor_period, 1}, {supervisor_intensity, 1000}, {supervisor_restart, permanent} ], Result = erlpool:start_pool(PoolName, Args),

And request

StreamRef = gun:post(ConnPid, Path, [ {<<"content-type">>, "application/json"}, {<<"content-length">>, integer_to_binary(size(ReqBody))} ], ReqBody),

GunResp = gun:await(ConnPid, StreamRef, ?TIMEOUT),

Maybe we have to add some extra options which are not required in 1.3?

essen commented 7 months ago

Hello, I'm not sure what you mean by "downgrade" here or what you're measuring exactly. I don't think there's any changes that require configuration although if you're connecting over HTTP/2 some settings are best tweaked.

Vkutovoy92 commented 7 months ago

Response time from services has increased, I have changed only 1.3.0 to 2.0.1 without any settings, and I have got worst percentiles. No any changes except gun version.

I can show graphics.

essen commented 7 months ago

I need a way to reproduce but having data would help understand what this is about yes.

Vkutovoy92 commented 7 months ago

There are 2 services on AWS + http2 with gun, response from 2nd service grew after update 1st service to 2.0.1 gun Снимок экрана 2023-11-17 в 13 10 46 20:00 update to 2.0.1.

Vkutovoy92 commented 7 months ago

So in % BEFORE Снимок экрана 2023-11-17 в 13 13 48 AFTER Снимок экрана 2023-11-17 в 13 13 46 No any changes except update client to new gun

essen commented 7 months ago

https://github.com/ninenines/gun/commit/4194682d4edaee3da34783c46a513698eb1e8d05 was meant to improve HTTP/2 performance when receiving larger bodies. But perhaps this had a negative impact in your case.

What is the size of the body you send, and what is the size of the body you receive (roughly)?

Another change that was done is send_timeout is now enabled by default.

There may be a few other things. If you can try different Gun commits it could help identify when things started getting worse. I could provide a few interesting commits to upgrade to and see what happens.

Vkutovoy92 commented 7 months ago

received_bytes sent_bytes 4004 4687 586 735 588 737 149 296 992 1213 200 299 592 740 6036 7050 568 716 1110 1324 265 366 330 454 1759 2071 149 296 825 1003 149 294 581 722 187 285 270 368 184 283 211 301 287 384 828 1003 211 301 998 1213 4123 4746 630 750 669 817 283 384 149 295 149 296 149 295 1072 1286 479 587 263 361 4815 5518 149 295 149 295 265 366 149 295 1048 1236 149 295 2676 2969 2135 2067 1108 1319 517 661 149 294 671 838 149 297 149 297 265 366 149 295 479 587 1581 1789 225 324

essen commented 7 months ago

Yeah small. So it's possible the higher default is causing trouble for the server. Try setting the http2_opts for initial_connection_window_size and initial_stream_window_size to their default values (65535). Or if you don't want to upgrade to test, try changing the default in 1.3.0 to 8000000 and see if that makes things worse.

Vkutovoy92 commented 7 months ago

And another received_bytes sent_bytes 558 2193 867 2983 343 1959 872 2997 868 3152 379 548 868 3188 869 3244 195 718 417 2354 493 683 676 2415 208 270 498 2838 871 3292 522 2471 634 2351 204 503 645 1326 343 1798 557 2113 923 3681 340 1797 198 709 177 272 197 722 618 2760 535 2502 494 684 642 2375 372 914 496 686 372 546 375 893 197 783 198 703 509 2728 198 718 721 3105 721 3033 557 2135 536 2729 535 2744

Vkutovoy92 commented 7 months ago

Yeah small. So it's possible the higher default is causing trouble for the server. Try setting the http2_opts for initial_connection_window_size and initial_stream_window_size to their default values (65535). Or if you don't want to upgrade to test, try changing the default in 1.3.0 to 8000000 and see if that makes things worse.

But I don't see initial_stream_window_size initial_connection_window_size in gun options

-type opts() :: #{
    connect_timeout => timeout(),
    http_opts       => http_opts(),
    http2_opts      => http2_opts(),
    protocols       => [http | http2],
    retry           => non_neg_integer(),
    retry_timeout   => pos_integer(),
    trace           => boolean(),
    transport       => tcp | tls | ssl,
    transport_opts  => [gen_tcp:connect_option()] | [ssl:connect_option()],
    ws_opts         => ws_opts()
}.

and http2_opts

-type http2_opts() :: #{
    keepalive => timeout()
}.
essen commented 7 months ago

Right it's not available in 1.3, sorry. I guess the only way to properly test is upgrading to 2.0 and set the option there to 65535.

#{ http2_opts => #{ initial_connection_window_size => 65535, initial_stream_window_size => 65535 }}
Vkutovoy92 commented 7 months ago

Right it's not available in 1.3, sorry. I guess the only way to properly test is upgrading to 2.0 and set the option there to 65535.

#{ http2_opts => #{ initial_connection_window_size => 65535, initial_stream_window_size => 65535 }}

Thank's! I'll try it later! Do I have to add additional options to

#{
    http2_opts => #{
      keepalive => 60 * 1000,
initial_connection_window_size => 65535, initial_stream_window_size => 65535 
    },
    connect_timeout => 5000,
    retry => 100,
    retry_timeout => 1000 % 1s
  },

except of yours?

essen commented 7 months ago

Go with that for now and let's see.

Vkutovoy92 commented 7 months ago
initial_connection_window_size => 65535, initial_stream_window_size => 65535 

It really works! Much better with options!

1st, 1.3.0 Снимок экрана 2023-11-17 в 19 34 47

After, 2.0.1 Снимок экрана 2023-11-17 в 19 34 54

essen commented 7 months ago

Glad to hear. I didn't expect that the default value's increase would have such a negative impact, I will need to do further experiments and perhaps change either the value or the related algorithm to better handle both cases (small bodies and large bodies). Might be worth looking at what browsers are doing too.

Vkutovoy92 commented 7 months ago

Glad to hear. I didn't expect that the default value's increase would have such a negative impact, I will need to do further experiments and perhaps change either the value or the related algorithm to better handle both cases (small bodies and large bodies). Might be worth looking at what browsers are doing too.

Another example

1.3.0 Снимок экрана 2023-11-17 в 20 00 31

2.0.1 default Снимок экрана 2023-11-17 в 20 00 36

2.0.1 + extra options Снимок экрана 2023-11-17 в 20 00 41

So you see 2.0.1 with not extra options gives really downgrade response time

Vkutovoy92 commented 7 months ago

Maybe you can advise how to correct choose these values? What's the rule?

Vkutovoy92 commented 7 months ago

I have a service which works correct with default gun, I'll give you sizes of bodies on Monday. It's really interesting, no any downgrade.

essen commented 7 months ago

Maybe you can advise how to correct choose these values? What's the rule?

It's just a control for how much memory you are willing to accept using. With the downside that the lower the value (and the lower the memory usage) the lower the performance. Default Gun has those values to 8MB to favor performance when having large bodies. But in Gun setting this value to 8MB has no effect on its own, no buffer gets allocated immediately, it just means there can be roughly 8MB in transit at once between your application and the server.

Clearly the server you are connected to doesn't seem to like that though. Maybe on your service that runs well you are connected to a different server. Or perhaps it's an issue related to shared resources or bandwidth. Figuring out the real cause is likely to be difficult.

Vkutovoy92 commented 7 months ago

Maybe you can advise how to correct choose these values? What's the rule?

It's just a control for how much memory you are willing to accept using. With the downside that the lower the value (and the lower the memory usage) the lower the performance. Default Gun has those values to 8MB to favor performance when having large bodies. But in Gun setting this value to 8MB has no effect on its own, no buffer gets allocated immediately, it just means there can be roughly 8MB in transit at once between your application and the server.

Clearly the server you are connected to doesn't seem to like that though. Maybe on your service that runs well you are connected to a different server. Or perhaps it's an issue related to shared resources or bandwidth. Figuring out the real cause is likely to be difficult.

So initial_connection_window_size - is it a max size of body for all stream at once time or what?

initial_stream_window_size - is it a max size for 1 stream?

essen commented 7 months ago

Best I redirect you to the spec, see https://datatracker.ietf.org/doc/html/rfc7540#section-6.9.1 for full details about flow control. The two options are what Gun will set for the initial values for the flow control window. Then Gun has an algorithm that ensures there's always some space in the window, see cow_http2_machine:ensure_window for the implementation.

Vkutovoy92 commented 7 months ago

Morning) Now I get new error

{badmap,{'EXIT',{{badmatch,{error,{stream_error,{closed,{error,closed}}}}}

And stacktrace shows a line with {ok, Body} = gun:await_body(ConnPid, StreamRef, ?TIMEOUT),

Maybe another additional options are needed?

essen commented 7 months ago

That just indicates the server closed a connection. Please open a separate ticket with the stacktrace.

dubrovine commented 7 months ago

That just indicates the server closed a connection. Please open a separate ticket with the stacktrace.

It’s already open https://github.com/ninenines/gun/issues/291

actually that’s why I switched back to gun 1.3.0

RoadRunnr commented 7 months ago

Have you tried to using tcp_opts => [{nodelay, true}] in gun:opts() ???

The way the HTTP/2 handler is implemented in gun means that gun will use two seperate gen_tcp:send for every HTPP/2 requests, one send for the header frame and one for the data frame. This leads to two TCP fragments being send and that pattern then triggers a bad interaction between the TCP Nagle algorithm and TCP delayed ACK.

The observable effect is typically a 40ms delay in the request.

Vkutovoy92 commented 7 months ago

Have you tried to using tcp_opts => [{nodelay, true}] in gun:opts() ???

The way the HTTP/2 handler is implemented in gun means that gun will use two seperate gen_tcp:send for every HTPP/2 requests, one send for the header frame and one for the data frame. This leads to two TCP fragments being send and that pattern then triggers a bad interaction between the TCP Nagle algorithm and TCP delayed ACK.

The observable effect is typically a 40ms delay in the request.

No, I haven't. I used initial_connection_window_size/initial_stream_window_size and it helped with reduce of response time.