ninenines / gun

HTTP/1.1, HTTP/2, Websocket client (and more) for Erlang/OTP.
ISC License
898 stars 231 forks source link

connection close and HTTPS #150

Closed adrianroe closed 5 years ago

adrianroe commented 6 years ago

The combination of "connection close" and HTTPS causes data loss or {error,{closed,"The connection was lost."}}

There is a very simple repro in this gist

The gist pulls a file from a CDN that is available over HTTP and HTTPS, optionally setting the connection close header. Of the 4 combinations, all work as expected other than the combination of HTTPS and connection close... This almost always fails with the below, although it can also simply return less data that you would expect (i.e. a good response, but with the data truncated before the end)

Erlang/OTP 19 [erts-8.3] [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V8.3  (abort with ^G)
1> test_close:get(80, true).
140421
140421
140421
140421
140421
140421
140421
140421
140421
140421
ok
2> test_close:get(80, false).
140421
140421
140421
140421
140421
140421
140421
140421
140421
140421
ok
3> test_close:get(443, false).
140421
140421
140421
140421
140421
140421
140421
140421
140421
140421
ok
4> test_close:get(443, true).
** exception error: no match of right hand side value {error,{closed,"The connection was lost."}}
     in function  test_close:get_/4 (src/test_close.erl, line 23)
adrianroe commented 6 years ago

I've tested the exact same connection with e.g. curl and all responses (including connection close with HTTPS) contain the full data...

I'll take a look to see if it is anything obvious, but any thoughts warmly appreciated!

essen commented 6 years ago

Can't reproduce. What Erlang version is this?

adrianroe commented 6 years ago

The current repro is on Erlang 19 - I'll pull it to an Erlang 20 server as well.

The issue is definitely timing related - I can only reliably recreate it when I have a good internet connection (but not localhost!). The current repro is on an AWS server.

adrianroe commented 6 years ago

I have also narrowed the issue down to the interaction with Ranch. If in gun.erl, loop/1 I change Transport:setopts(Socket, [{active, once}]), to Transport:setopts(Socket, [{active, true}]), I no longer see the issue. That's obviously not a change that can just be made without thought!

essen commented 6 years ago

Would be useful to trace what the Gun process is doing. If the Transport:setopts call leads to a socket close then it means the problem is higher up the chain, perhaps in the SSL application. I think there was an issue recently about close events superseding any lingering data, I'll try to look it up later.

To trace:

dbg:start().
dbg:tracer().
dbg:tpl(gun, []).
dbg:p(all, c).
adrianroe commented 6 years ago

Just tried it locally (OSX - on both OTP 19 and 20) and don't see the issue. I'm sure that it's because of the slowness of my internet connection!

I'll see if I can get a local only repro (gun -> cowboy) - or I can probably get you access to a cloud server where is can be repro'd

essen commented 6 years ago

Run the commands I gave on the node with the issue (not via remote shell) and I'll be able to have a clearer idea of the issue.

adrianroe commented 6 years ago

trace.txt

essen commented 6 years ago

Right but it terminates the node too early, please don't run the test in an -eval, use a proper shell instead.

adrianroe commented 6 years ago

trace2.txt

adrianroe commented 6 years ago

...and it does look like there is a related ssl issue http://erlang.org/doc/apps/ssl/notes.html

1.2 SSL 8.2.3 Fixed Bugs and Malfunctions Packet options cannot be supported for unreliable transports, that is, packet option for DTLS over udp will not be supported.

Own Id: OTP-14664

Ensure data delivery before close if possible. This fix is related to fix in PR-1479.

Own Id: OTP-14794

adrianroe commented 6 years ago

My repro box is running SSL 8.1.1 - I can't install 20 on that box, but might create a new server tomorrow with latest on it...

essen commented 6 years ago

Yeah that's what I would guess happens. Relevant ticket is https://bugs.erlang.org/browse/ERL-420

And sorry about that, should have thought about it earlier, but a more interesting trace would be:

dbg:start().
dbg:tracer().
dbg:tpl(gun, []).
dbg:tpl(gun_http, []).
dbg:p(all, c).
adrianroe commented 6 years ago

trace3.txt Thanks for the help!

essen commented 6 years ago

Sounds like the Gun version is also old. What version is this?

adrianroe commented 6 years ago

Head

essen commented 6 years ago

There's a gun_http:send_data_if_alive/1 call that doesn't exist, and then it immediately calls gun:connect without going through retry_loop? Really weird.

essen commented 5 years ago

Since I've not been able to reproduce and it's been a while, and there's been a number of related ssl bugs fixed in recent Erlang/OTP versions, please try with the most recent version and reopen if there's still an issue. Thanks!

adrianroe commented 5 years ago

For anyone watching this ticket - the symptoms of this ticket still persist - still caused by a (now different) issue in the Erlang SSL library. See https://bugs.erlang.org/browse/ERL-371 for details.

We have experimented with the master branch referred to in ERL-371 which does indeed seem to prevent the issue.