openssl / openssl

TLS/SSL and crypto library
https://www.openssl.org
Apache License 2.0
25.34k stars 10.04k forks source link

New EOF detection breaks session resumption #11378

Open roussosalex opened 4 years ago

roussosalex commented 4 years ago

We want to do session resumption.

If we follow the documentation of SSL_shutdown:

In case the application wants to be able to resume the session, it is recommended to do a complete shutdown procedure (bidirectional close_notify alerts). [...] The shutdown is not yet finished: the close_notify was sent but the peer did not send it back yet. Call SSL_read() to do a bidirectional shutdown.

Consider the following pseudocode:

// obtain SSL_SESSION via callback during SSL_connect
...

// bidirectional shutdown
if (SSL_shutdown(ssl) == 0)
    SSL_read(ssl, NULL, 0);

// try to resume session via new ssl object
...

Due to new EOF detection (db943f43a60d1b5b1277e4b5317e8f288e7a0a3a), the call to SSL_read fails here and invalidates the session (SSLfatal) breaking resumption.

Version: OpenSSL 1.1.1e and current master (7e06a6758bef584deabc9cb4b0d21b3e664b25c9) Proposed fix: If bidirectional shutdown is supposed to work correctly, a zero-length EOF in SSL_read should not be a fatal error.

/cc @kleest

kroeckx commented 4 years ago

Both peers should always send the close notify (call SSL_shutdown), but then have 2 options: 1) Wait to receive the peer's close notify 2) Because of the protocol, know that the only thing that can come now is the close notify and not wait for it.

My understanding is that you try to do 1), but the server is failing to do it's part and is not sending the close notify.

At the SSL layer, we don't know that the communication is done or not, so if we don't receice the close notify, but do detect an EOF, we need to return an error. If we do not return an error, you might be vulnerable to a truncation attack.

Because you don't know that the other side is going to do 1) or 2), you should always send the close notify.

So I think you now have those options:

beldmit commented 4 years ago

It's worth reminding that openssl s_client/s_server do not provide a relevant example. I agree that the broken one in this pair is s_server, but it's worth fixing it.

roussosalex commented 4 years ago

@beldmit This is not just a s_server problem. We found broken resumption using nginx as well.

@kroeckx While you are right that the server should send a close notify, many servers don't. So "fix the server" is not something we can do. I did a quick test against the Alexa Top 50 and the rate of successful resumptions using bidirectional shutdown went from 70% using 1.1.1d to 30% using 1.1.1e. If SSL_read is going to signal an error, at least don't invalidate the session while doing so.

As a general note, this kind of breaking change is not something I would have expected in a bugfix release.

kroeckx commented 4 years ago

As a general note, this kind of breaking change is not something I would have expected in a bugfix release.

I did not expect software to ignore that we already returned an error. Looking at some code, they seem to have special cased that error and decided to just ignore the error, possibly opening themselves for a truncation attack.

I think it was actually well known that HTTPS is broken in this regard, that many servers do not properly close the connection while they should. The only recommendation I have for that is to not call SSL_read() when you know that you have already received everything, or that you ignore it in that case you have received everything. If you don't know that everything has been received, and you don't get a close notify, you really should get an error.

I'm currently unsure if we should revert this or not. There probably is code where because of the protocol it's clear that everything was send or not, and such code worked properly before if they had a special case for that error. But all other code that just ignores it should get fixed instead. As I understand it, HTTP 1.1 is not always such a protocol. So I guess I'm waiting to see examples of non-broken code that is affected by the change.

kroeckx commented 4 years ago

We can consider revering it in 1.1.1 but keeping it in master/3.0. But I'm currently still unsure what to do, we want to encourage people to fix this. And it seems non-trivial for code that actually knows all data was received and can't avoid calling SSL_read() to ignore the error.

roussosalex commented 4 years ago

I'm currently unsure if we should revert this or not.

This seems like the best solution for the near future until we can find a better way to handle the EOF issue.

And it seems non-trivial for code that actually knows all data was received and can't avoid calling SSL_read() to ignore the error.

We would love to ignore the error, except that SSL_read now calls SSLfatal, which invalidates the session attached to it and makes it not resumable. We have no direct control over the validity of the session and cannot predict if a server will send a close notify or not.

If this problem persists, we can only switch to unidirectional shutdown and hope that the issue gets fixed in a future release.

mattcaswell commented 4 years ago

I'm currently unsure if we should revert this or not.

We effectively changed an error from SSL_ERROR_SYSCALL to SSL_ERROR_SSL. This is correct because our own documentation says to go check errno if you get the former because there's been a system level IO error, or go check the OpenSSL error stack with the latter because there's been some TLS level problem. The problem really is at the TLS level, and errno is 0 in this case so returning SSL_ERROR_SYSCALL is just incorrect behaviour. Since it was a bug fix it met the requirements for backport to 1.1.1

Since we are swapping one type of error for another, one might have expected it to not be too big a deal. Nonetheless, I had a suspicion when I opened the 1.1.1 backport PR this that some code might find this change unexpected. For that reason I requested multiple approvals but still the consensus seemed to be at that time that we should still backport.

As always with bug fixes, one person's bug fix is another person's breaking change, if they were relying on the buggy behaviour. The problem now is having made the decision to backport does it just confuse things further to revert it...only to reintroduce it again in master? I'm also unsure as to the correct answer to this.

beldmit commented 4 years ago

I think, at first, it should be fixed in openssl code itself - at least s_client/s_server pair should be a relevant example.

t8m commented 4 years ago

A compromise could be to not break the session resumption if unexpected EOF is obtained. Is there any security relevant reason why the session should be invalidated in this case?

beldmit commented 4 years ago

I've got some private comments from the Nginx team. They are very unhappy, especially speaking about broken session resumption. If they provide some more details, I'll resend them to the project list.

beldmit commented 4 years ago

http://mailman.nginx.org/pipermail/nginx/2020-March/059175.html

kroeckx commented 4 years ago

A compromise could be to not break the session resumption if unexpected EOF is obtained. Is there any security relevant reason why the session should be invalidated in this case?

I think we should do that.

kroeckx commented 4 years ago

http://mailman.nginx.org/pipermail/nginx/2020-March/059175.html

Note that #11381 talks about that.

roussosalex commented 4 years ago

A compromise could be to not break the session resumption if unexpected EOF is obtained

That would fix at least this issue, but the other EOF issues persist.

roussosalex commented 4 years ago

I would like to suggest adding an option to SSL_set_options to switch between the old and new EOF handling, which would be set to "old" by default. This would solve all EOF related issues for software relying on the old behavior while keeping the option for anyone concerned about #10880.

mattcaswell commented 4 years ago

In practice I'm not sure how many people would actually use that option. I'm wondering whether a better solution is to keep this fix in master, but revert it in 1.1.1 and just document it as a known bug.

beldmit commented 4 years ago

I think reverting in 1.1.1 is the best option.

t8m commented 4 years ago

I am afraid that this is the most reasonable way for 1.1.1.

And still in the master it would be worth allowing the resumption after the unexpected EOF unless we have a very good reason not to do that.

richsalz commented 4 years ago

If you keep it in 1.1.1 (no comment on that), please add a big CHANGES item that says this is broken and is changing in the next release, and point to a FAQ entry that explains what to do.

kaduk commented 4 years ago

Aren't we out of bits in the SSL_OP_ namespace in 1.1.1 anyway?

bagder commented 4 years ago

The server-sides causing these problems will not just fix themselves in the mean-time so they will just show up later rather than sooner if you revert in 1.1.1 and leave enabled in v3. Postponing the decision will not help. It is either a bug or it isn't.

levitte commented 4 years ago

The server-sides causing these problems will not just fix themselves in the mean-time so they will just show up later rather than sooner if you revert in 1.1.1 and leave enabled in v3. Postponing the decision will not help. It is either a bug or it isn't.

I agree with that. However, if things are breaking all around because some servers have chosen to ignore the previous error code (see what @mattcaswell wrote higher up), the fix may have more current casualty than anyone is comfortable with. In an ideal world, all identified bugs would be fixed and rolled out immediately... unfortunately, that's simply not realistic.

I'm not saying which way we should go, I'm frankly undecided on this, and these bits are not my forte, but I agree with @richsalz that if we decide to revert the change in 1.1.1, that reversal should come with prominent documentation, so people have a chance to see that there is an issue, and time to fix their software until they start tackling 3.0.

kroeckx commented 4 years ago

I think we're all agreeing that it is a bug, and that it should be fixed. It just that other software doesn't seem to be ready to be to fix it without breaking too much.

I think we should at least try to find which implementations of https servers don't do this, and try to get those implementations fixed.

mattcaswell commented 4 years ago

The server-sides causing these problems will not just fix themselves in the mean-time so they will just show up later rather than sooner if you revert in 1.1.1 and leave enabled in v3. Postponing the decision will not help. It is either a bug or it isn't.

IMO there is no doubt that this is a bug. However - all bug fixes are behaviour changes. Normally we hope that the behaviour change introduced by a bug fix into a stable branch is desirable because no one wants the old incorrect behaviour. However, every now and then we come across a bug like this one. The old behaviour has been there for so long that other software has been written to expect the incorrect behaviour.

Since 1.1.1 is supposed to be stable, and this has broken stuff, it seems that the correct solution is to revert it. However, 3.0 is a major release. We are trying to keep breaking changes in 3.0 to an absolute minimum, but we do not rule them out entirely. Software authors should reasonably expect to have to test that their software still works when upgrading to 3.0, and might have to make some minor changes in some cases. So, it still seems to me entirely appropriate to fix this bug in that release.

That said this has highlighted a couple of related problems which should be fixed in 3.0:

It's entirely possible that there are other related fixes that we should do to minimise the impact. So it would be good if we could identify those.

t8m commented 4 years ago

Can we at least do the revert in 1.1.1 soon? I can prepare PR for that.

mattcaswell commented 4 years ago

Can we at least do the revert in 1.1.1 soon? I can prepare PR for that.

Please do!

mattcaswell commented 4 years ago

Does this imply that we need to do a 1.1.1f soon?

t8m commented 4 years ago

I suppose so. A related issue to solve in master is also the #11388

t8m commented 4 years ago

PR for the 1.1.1 revert: #11400

richsalz commented 4 years ago

I agree that bugfixes are behavior changes, and that 3.0 is a major release. It would be unfortunate if, for example, it wasn't possible to buid and USE a FIPS-capable curl, for example, or set up a FIPS-capable nginx web server.

Please consider these things when you decide if the EOF detection stays in 3.0

vdukhovni commented 4 years ago
// bidirectional shutdown
if (SSL_shutdown(ssl) == 0)
    SSL_read(ssl, NULL, 0);

Why is SSL_shutdown() being followed by an SSL_read() of zero bytes? The standard way to do a bidirectional shutdown is call SSL_shutdown() again to wait for the peer's close-notify, but that is not necessary, it suffices to for each peer to send their close-notify, there's no need to wait to receive the close notify from the other side, if application-level framing already makes it clear that all data-transmission is done.

vdukhovni commented 4 years ago

I agree that bugfixes are behavior changes, and that 3.0 is a major release. It would be unfortunate if, for example, it wasn't possible to buid and USE a FIPS-capable curl, for example, or set up a FIPS-capable nginx web server.

Please consider these things when you decide if the EOF detection stays in 3.0

I think the fix should stay in 3.0. Migration to 3.0 requires at least a rebuild/test of the code, and most applications should continue to work, and those that don't can be remediated to not be oblivious to truncation attacks.

The issue with introducing this into 1.1.1e is that applications tested with previous 1.1.x releases would now without additional build/test cycles be suddenly exposed to a library that presents a potentially unexpected and not entirely uncommon end-of-session behaviour. That's a higher level of risk than for recertification (build/test) of the application for 3.0, likely on a newer overall OS release, ...

Which is not to say that we couldn't discuss providing backwards-compatibility crutches for 3.0, but I'd like to know more about actual, rather than hypothetical impact. Which applications and which corner cases are running into this?

kroeckx commented 4 years ago

After you've sent a shutdown, the peer can still send data, so you can still receive data. If the peer did still send data, SSL_shutdown() will actually generate an error, because you've lost data. So the recommended way to do it is only call SSL_shutdown() once and then call SSL_read() until it returns SSL_ERROR_ZERO_RETURN. So calling SSL_read() with 0 doesn't make any sense.

roussosalex commented 4 years ago

@vdukhovni We want to be able to resume the session, so following the documentation:

In case the application wants to be able to resume the session, it is recommended to do a complete shutdown procedure (bidirectional close_notify alerts).

Which applications and which corner cases are running into this?

Any application doing bidirectional shutdown (apply attached patched file to s_client) with any affected server, e.g. google.com:

openssl s_client -connect google.com:443 -reconnect -tls1_2

Notice every connection being reported as New instead of Reused with 1.1.1e

s_client_bidirectional.patch.txt

vdukhovni commented 4 years ago

After you've sent a shutdown, the peer can still send data, so you can still receive data. If the peer did still send data, SSL_shutdown() will actually generate an error, because you've lost data. So the recommended way to do it is only call SSL_shutdown() once and then call SSL_read() until it returns SSL_ERROR_ZERO_RETURN. So calling SSL_read() with 0 doesn't make any sense.

Correct. In my comment, I was tacitly assuming that the application protocol does not support half-close, and that at the point at which SSL_shutdown() was being called, neither side had anything more to send. In which one should call either SSL_shutdown() again, or as you point out keep reading and processing (perhaps discarding in a sufficiently odd corner case) residual data until the peer has nothing further to send. I don't understand what the SSL_read() of zero bytes was supposed to accomplish.

The code in question also does not handle WANT_READ/WANT_WRITE, etc. so I assume the sockets in question are not non-blocking... Otherwise there would need to be additional logic to retry the SSL_shutdown() call after waiting for the socket to drain, ... e.g. https://github.com/vdukhovni/postfix/blob/master/postfix/src/tls/tls_bio_ops.c#L192-L286

MylesBorins commented 4 years ago

@sam-github should we land https://github.com/termux/termux-packages/pull/5075 before releasing any versions of node with 1.1.1e?

edit: just noticed that this is not a PR against node.. oops. Should we do something similar in core?

sam-github commented 4 years ago

@MylesBorins There is no evidence from our test suites that Node.js itself is affected by this. From conversation above, it sounds like correct handling of errors is robust to this change. "could we" be affected? Of course, any change could possibly break someone, somewhere, but we don't have evidence of it yet.

The problem we had in our test suites is that the tests that spawn the openssl s_client/s_server CLIs and interact with them (to make sure we can talk to non-nodejs clients/servers) was affected by a bug in s_client, but that's not worth floating a patch for, s_client is only used internally, we don't ship it.

MylesBorins commented 4 years ago

KK, thanks for the heads up

On Wed, Mar 25, 2020 at 2:30 PM Sam Roberts notifications@github.com wrote:

@MylesBorins https://github.com/MylesBorins There is no evidence from our test suites that Node.js itself is affected by this. From conversation above, it sounds like correct handling of errors is robust to this change. "could we" be affected? Of course, any change could possibly break someone, somewhere, but we don't have evidence of it yet.

The problem we had in our test suites is that the tests that spawn the openssl s_client/s_server CLIs and interact with them (to make sure we can talk to non-nodejs clients/servers) was affected by a bug in s_client, but that's not worth floating a patch for, s_client is only used internally, we don't ship it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openssl/openssl/issues/11378#issuecomment-604011000, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADZYV33RQVW3SXNMPVBOELRJJEVJANCNFSM4LREG3TA .

kroeckx commented 4 years ago

I think we should at least try to find which implementations of https servers don't do this, and try to get those implementations fixed.

I've been looking into this, checking various sites, among others the alaxa top 50, and only found 4 companies that did not properly send a close notify: Google, Facebook, Cloudflare, Microsoft. I've contacted the first 3 of those without much luck so far.

kaduk commented 4 years ago

Session resumption breaking was unexpected - so we should look to fix that

I've run into this before when writing tests -- if you don't call SSL_shutdown() before SSL_free(), your session gets removed from the internal cache. I'm highly confident this behavior was added for security posture reasons, as a potential truncation of the connection could have been the result of an attacker deliberately removing information that would have made it desirable to not use the session in the course of normal operation. An attempt to change the behavior here should be cognizant of that and do some more research into the scenarios involved.

vdukhovni commented 4 years ago

I've been looking into this, checking various sites, among others the alaxa top 50, and only found 4 companies that did not properly send a close notify: Google, Facebook, Cloudflare, Microsoft. I've contacted the first 3 of those without much luck so far.

In the case of HTTP does it matter whether the server sends close-notify? Provided each request receives a complete response, it is not clear that close-notify is needed. What breaks?

bagder commented 4 years ago

Because there are HTTP responses where the close of the stream signifies the end of body so we need to be sure it was a clean shutdown.

kroeckx commented 4 years ago

In the case of HTTP does it matter whether the server sends close-notify? Provided each request receives a complete response, it is not clear that close-notify is needed. What breaks?

It should be possible to check it at the HTTP level, but each time they tried to check it, they had to revert the changes because it breaks too much. So we are forced to consider HTTP 1.1 to be a protocol that is vulnerable to a truncation attack.

So instead of checking it at the HTTP level, we can check it at the TLS level. But there are many servers that don't send the close notify, and so it's also impossible to check for the truncation attack at the TLS level.

The current result is that no browser has protection against a truncation attack.

davidben commented 4 years ago

@kroeckx That is not quite correct. HTTP/1.1 has a defined terminator for headers (newline) and header blocks (blank line), and three ways to terminate response bodies.

Clients should not process unterminated header blocks. (Note there is some complication here because HTTP/0.9 is somehow not dead yet.) I believe Chrome enforces this over HTTPS (ERR_RESPONSE_HEADERS_TRUNCATED). There are then three ways to terminate response bodies:

  1. If the response has Transfer-Encoding: chunked, each chunk has a length prefix and the whole thing is terminated by an empty chunk. The HTTP client should only report a clean end of stream if the length prefixes are counted up and there is an empty chunk. I believe Chrome enforces this (ERR_INCOMPLETE_CHUNKED_ENCODING).
  2. If the response has a Content-Length header, that is the number of bytes to read. The HTTP client must count that many bytes before reporting a clean end of stream. I believe Chrome enforces this (ERR_CONTENT_LENGTH_MISMATCH).
  3. If neither (1) nor (2) apply, the response is terminated by EOF.

Of these, (3) is the only one that depends on close_notify. Unfortunately, close_notify is a fiction so EOF-terminated response bodies are simply unsafe and HTTP/1.1 servers need to ensure they always include Content-Length if the length is known, and use chunked encoding if the length is unknown. (Or deploy HTTP/2 and then you're fine. Happily ALPN is downgrade-protected.) Note that, for responses that use (1) or (2), whether the server sends close_notify is irrelevant because the client should be rejecting an early termination regardless.

vdukhovni commented 4 years ago

Of these, (3) is the only one that depends on close_notify. Unfortunately, close_notify is a fiction so EOF-terminated response bodies are simply unsafe and HTTP/1.1 servers need to ensure they always include Content-Length if the length is known, and use chunked encoding if the length is unknown. (Or deploy HTTP/2 and then you're fine. Happily ALPN is downgrade-protected.) Note that, for responses that use (1) or (2), whether the server sends close_notify is irrelevant because the client should be rejecting an early termination regardless.

Thanks for the detailed response, this is what I was trying to hint at, lack of close-notify should only affect HTTP/1.0 servers that stream data and can't provide a Content-Length, or poorly-implemented HTTP/1.1 servers that don't use chunked transfer encoding when they should.

I don't know how common HTTP < 1.1 is these days, nor how common it is for 1.1 servers to skimp to not do chunked encoding as required.

bagder commented 4 years ago

The "poorly-implemented HTTP/1.1 servers" are still out there and are being used. How common? Impossible to say.

AdrianBunk commented 4 years ago

I was tacitly assuming that the application protocol does not support half-close, and that at the point at which SSL_shutdown() was being called, neither side had anything more to send. In which one should call either SSL_shutdown() again, or as you point out keep reading and processing (perhaps discarding in a sufficiently odd corner case) residual data until the peer has nothing further to send. I don't understand what the SSL_read() of zero bytes was supposed to accomplish.

People are following the documentation for return value 0 of SSL_shutdown().

Current OpenSSL documents:

Call SSL_read() to do a bidirectional shutdown.

OpenSSL before commit 00f561ab9c7 versions documents:

Call SSL_shutdown() again to do a bidirectional shutdown.

Both have have been part of the documentation of stable OpenSSL releases, so users can expect either of these options to work.

If ABI breaks happen by a change breaking past documented behaviour, please add a new version with a different name instead (e.g. SSL_shutdown2). Silently changing documented behaviour breaks users of OpenSSL, in the worst case silent breakage introducing security vulnerabilities.

Who at OpenSSL is responsible for ensuring that current implementation still matches the behaviour in the documentation of older OpenSSL releases?

kroeckx commented 4 years ago

I think there is a misunderstanding here. The problem is when the other side of the connection does not call SSL_shutdown(). We have always generated an error an error for that, but 1.1.1e generated a different error.

AdrianBunk commented 4 years ago

I think there is a misunderstanding here. The problem is when the other side of the connection does not call SSL_shutdown().

You wrote earlier in this discussion:

The only recommendation I have for that is to not call SSL_read() when you know that you have already received everything, or that you ignore it in that case you have received everything.

You changed the documentation of SSL_shutdown() to read in current stable OpenSSL releases:

Call SSL_read() to do a bidirectional shutdown.

The problem is not the other side of the connection, the problem is that you are breaking working applications that have been coded following your documentation.

We have always generated an error an error for that, but 1.1.1e generated a different error.

If this breaks behaviour documented anywhere in the documentation of any released OpenSSL version, then this might require a renaming of the changed function. Functions changing documented behaviour can be nasty in many different ways when upgrading OpenSSL.

kroeckx commented 4 years ago

Can we stop having this discussion in 2 places at the same time?