Large file uploads fail when using TLS and Cowboy 2.7+

Shemeikka commented 3 years ago

Hi,

We are experiencing failed file uploads when using TLS and Cowboy 2.7+ (2.7 - 2.9.0). Large file uploads (1GB+) will be shown as failed and stalled in Chrome and Firefox, and when using Curl, the response is curl: (56) Unexpected EOF.

Uploading smaller files about 100MB works with TLS and uploading large files works if TLS is not used. However, if we use TLS and Cowboy 2.6.3, then uploading large files works.

Test file can be generated using command fallocate -l 10G gentoo_root.img. Uploading the file can be done using command curl -F 'file=@gentoo_root.img' <URL>.

How we configure Cowboy in our Elixir project. Certificates are generated using mkcert.

# Configuring Cowboy in Elixir project's application.ex
{Plug.Cowboy,
scheme: server_options[:scheme],
plug: Knlgw.Endpoint,
options: [
    ip: server_host_ip,
    port: server_port,
    certfile: Keyword.get(server_config, :cert_file),
    keyfile: Keyword.get(server_config, :key_file),
    cacertfile: Keyword.get(server_config, :ca_cert_file),
    cipher_suite: :strong,
  ]
}

We use Plug.Parser for reading the multipart file upload. Those options i.e. length and others, don't have any effect on this problem.

# Use Plug.Parser.Multipart for parsing the request body if it's multipart
plug(Plug.Parsers,
  parsers: [
    {:multipart,
     length: 2_000_000_000, read_length: 5_000_000, read_timeout: 60_000, send_timeout: 10_000}
  ],
  pass: ["*/*"]
)

Versions:


hexpm/elixir:1.11.2-erlang-23.1.5-alpine-3.12.1

cowboy 2.9.0
plug_cowboy 2.5.0

Also tried with Erlang 23.3 versions and same experience.

essen commented 3 years ago

Would be good to trace the Cowboy connection process to see what it's doing. Or get a process_info(Pid) as a start.

Shemeikka commented 3 years ago

Hi,

Here's a process info for the cowboy supervisor

[
  registered_name: :cowboy_sup,
  current_function: {:gen_server, :loop, 7},
  initial_call: {:proc_lib, :init_p, 5},
  status: :waiting,
  message_queue_len: 0,
  links: [#PID<0.240.0>, #PID<0.242.0>],
  dictionary: [
    "$ancestors": [#PID<0.240.0>],
    "$initial_call": {:supervisor, :cowboy_sup, 1}
  ],
  trap_exit: true,
  error_handler: :error_handler,
  priority: :normal,
  group_leader: #PID<0.239.0>,
  total_heap_size: 986,
  heap_size: 610,
  stack_size: 12,
  reductions: 1530,
  garbage_collection: [
    max_heap_size: %{error_logger: true, kill: true, size: 0},
    min_bin_vheap_size: 46422,
    min_heap_size: 233,
    fullsweep_after: 65535,
    minor_gcs: 4
  ],
  suspending: []
]

Only change to the process info during the failing upload is slightly larger 'reductions'. Don't know how I could get process info for the actual connection process since the connection closes usually within 2-3 seconds.

Here's an link to a Gist which has output of Observer trace of Cowboy during the failed upload. I had to remove large amount of lines, mostly binary data, from the output.

essen commented 3 years ago

No not cowboy_sup, the connection process. If it gets stuck in a function it'll tell you which but you have to find the process first. But nevermind that, it sounds like the connection is being closed, rather than it being stuck?

If it's a crash then you're going to have a crash report. If it's "normal" then you won't. Your best bet then would be to investigate what happens in terminate, it'll give you the close reason.

Shemeikka commented 3 years ago

Should the tracing show if the connection process closes? The linked Gist only shows when the connection is spawned.

Any ideas how to catch cowboy connection process termination? I'm not receiving any crash reports or other messages to log when using standard plug_cowboy with added logging.

I have tried logging with plug by setting setting

config :plug_cowboy,
    log_exceptions_with_status_code: [1..599]

I also tried using PhoenixCowboyLogging which uses protocol_options: :onresponse but that didn't have any effect.

essen commented 3 years ago

OK I have missed the gist. You can trace for proc events exclusively. That should be enough. No need for calls or messages.

Shemeikka commented 3 years ago

Hi,

Here's a link to a Gist. This time it's more interesting than the previous.

Here's a process info of PID 0.3924.0 during the upload. Not sure if that's the connection process.

> Process.info(pid("0.3924.0"))
[
  current_function: {:gen_statem, :loop_receive, 3},
  initial_call: {:proc_lib, :init_p, 5},
  status: :waiting,
  message_queue_len: 0,
  links: [#PID<0.3923.0>, #Port<0.704>, #PID<0.132.0>],
  dictionary: [
    "$initial_call": {:ssl_gen_statem, :init, 1},
    ssl_pem_cache: :ssl_pem_cache,
    "$ancestors": [:tls_connection_sup, :tls_sup, :ssl_connection_sup, :ssl_sup,
     #PID<0.121.0>],
    ssl_manager: :ssl_manager
  ],
  trap_exit: true,
  error_handler: :error_handler,
  priority: :normal,
  group_leader: #PID<0.120.0>,
  total_heap_size: 15143,
  heap_size: 4185,
  stack_size: 18,
  reductions: 29117680,
  garbage_collection: [
    max_heap_size: %{error_logger: true, kill: true, size: 0},
    min_bin_vheap_size: 46422,
    min_heap_size: 233,
    fullsweep_after: 65535,
    minor_gcs: 1350
  ],
  suspending: []
]

I did some more testing and noticed that this upload failure happens only when using HTTP/2 with TLS. Uploading large files works when client forces the use of HTTP/1.1 with TLS.

essen commented 3 years ago

There might have been interesting events but they have been dropped. The exit reason is normal and that doesn't help.

OK then I suppose figuring out what cowboy_http2 is doing is the next step. So again a trace this time dbg:tpl(cowboy_http2, []). dbg:p(all, c). - trace to file, and see at what point it goes into the terminate functions. What happens immediately before is likely the reason why the upload fails.

Shemeikka commented 3 years ago

Hi,

I had to try several different settings due to the large amount of data but I finally got something interesting

08:12:30:156917 (<0.525.0>) in {code_server,call,1}
08:12:30:156987 (<0.525.0>) spawn <0.530.0> as erlang:apply(#Fun<gen_statem.0.120824088>,[])
08:12:30:156999 (<0.530.0>) spawned <0.525.0> {erlang,apply,[#Fun<gen_statem.0.120824088>,[]]}
08:12:30:157010 (<0.525.0>) out {gen_statem,call_clean,4}
08:12:30:157177 (<0.527.0>) in {cowboy_http2,terminate_linger_loop,3}
08:12:30:157200 (<0.527.0>) exit {shutdown,{connection_error,enhance_your_calm,
                            'Frame rate larger than configuration allows. Flood? (CVE-2019-9512, CVE-2019-9515, CVE-2019-9518)'}}
08:12:30:157210 (<0.527.0>) out_exited 0
08:12:30:157291 (<0.530.0>) in {erlang,apply,2}
08:12:30:157298 (<0.530.0>) out {gen,do_call,4}
08:12:30:157315 (<0.525.0>) in {gen_statem,call_clean,4}
08:12:30:157321 (<0.525.0>) out {gen_statem,call_clean,4}
08:12:30:157329 (<0.524.0>) in {gen_statem,loop_receive,3}
08:12:30:157363 (<0.524.0>) out {gen_statem,loop_receive,3}
08:12:30:157369 (<0.530.0>) in {gen,do_call,4}
08:12:30:157374 (<0.530.0>) exit normal
08:12:30:157377 (<0.530.0>) out_exited 0
08:12:30:157383 (<0.524.0>) in {gen_statem,loop_receive,3}
08:12:30:157386 (<0.524.0>) out {gen_statem,loop_receive,3}
08:12:30:157390 (<0.525.0>) in {gen_statem,call_clean,4}
08:12:30:157407 (<0.525.0>) spawn <0.531.0> as erlang:apply(#Fun<tls_connection.0.3952457>,[])
08:12:30:157410 (<0.531.0>) spawned <0.525.0> {erlang,apply,[#Fun<tls_connection.0.3952457>,[]]}
08:12:30:157417 (<0.525.0>) exit {shutdown,peer_close}
08:12:30:157422 (<0.525.0>) out_exited 0
08:12:30:157457 (<0.531.0>) in {erlang,apply,2}
08:12:30:157463 (<0.531.0>) out {tls_connection,'-ensure_sender_terminate/2-fun-0-',1}
08:12:30:157467 (<0.524.0>) in {gen_statem,loop_receive,3}
08:12:30:157472 (<0.524.0>) exit {shutdown,peer_close}
08:12:30:157478 (<0.524.0>) out_exited 0
08:12:35:158429 (<0.531.0>) in {tls_connection,'-ensure_sender_terminate/2-fun-0-',1}
08:12:35:158452 (<0.531.0>) exit normal
08:12:35:158457 (<0.531.0>) out_exited 0

This seems to be same problem as in #1497 ?

I tried setting max_received_frame like protocol_options: [max_received_frame_rate: {150_000, 10_000}] and with that value, the upload works with HTTP/2 and TLS.

Our use-case should be able to handle large files (10-20GB) and I'm thinking if we enable large (500_000+?) values for max_received_frame, then we are vulnerable against those mentioned CVEs. Are there any other mitigations in Cowboy for those CVEs?

essen commented 3 years ago

I guess we should increase it again. I had increased to 10k/10s but if that's not enough then it should be increased again.

Those CVEs are not super important, especially in Erlang, and it's basically impossible to get a value that works out for everyone. So I think we could increase the default to whatever works out for you.

Alternatively, increasing max_frame_size_received should reduce the number of frames you're getting. The default is very conservative (it's the protocol's default). It might make sense to provide a different default for Cowboy.

A third solution would be to produce a warning log for these types of connection errors (enhance_your_calm) with the message included and perhaps some help to remedy to this situation.

Shemeikka commented 3 years ago

Okay, I think it's enough if we just set both max_received_frame and max_frame_size_received to some good values in our application. Will have to test to figure out what those values could be.

linchengcs commented 2 years ago

I once had the a similar issue, and fixed it by setting idle_timeout larger. I guess the connection terminated before the file transfer finished.

essen commented 9 months ago

Hello everyone, I am looking for feedback as to what the max_received_frame and max_frame_size_received values should be. What did you end up using? Thank you.

For idle_timeout there is now an option reset_idle_timeout_on_send that should help. It will default to false for backward compatibility reasons but the default will likely change in 3.0.

essen commented 9 months ago

Nginx allows 16MB (the maximum possible): https://github.com/nginx/nginx/blob/6ceef19/src/http/v2/ngx_http_v2.h#L21 and https://github.com/nginx/nginx/blob/6ceef19/src/http/v2/ngx_http_v2.c#L2758

I don't think we want to increase it that much but we can definitely use a higher default value. If we increase it to 1MB then this doesn't invalidate the CVE protection and we can still receive large files. I think I will also increase the max_receive_frame_sent so that we send responses faster by default. Using 1MB (1048576) for both.

essen commented 9 months ago

OK the max frame size sent is already infinity so nothing to do there. For the max_frame_size_received I will only update the manual to indicate the importance of higher values. The default will have to wait for Cowboy 2.12 though because I am not updating Cowlib in 2.11. So this ticket will be left open for now.

essen commented 9 months ago

Documentation update is done. Leaving this ticket open, thanks!

ninenines / cowboy

Large file uploads fail when using TLS and Cowboy 2.7+ #1523