async_exec over ssl connection can fail on ubuntu due to weird interaction with openssl

larskanis commented 12 years ago

Original report by Caio Chassot (Bitbucket: kch, GitHub: kch).

Where cert is an instance of OpenSSL::X509::Certificate, and conn an instance of PGconn connected with sslmode:'require', calling conn.async_exec after a call to cert.verify(key) will break with an SSL Error from PG, but only if the verify call returned false.

Everything works fine with sslmode=disabled.
If verify returns true everything works.
Calling exec instead of async_exec also works fine. (But sequel uses async_exec internally.)
So far this was only reproducible on Ubuntu, with both 10.04 and 12.04; works fine on OS X. (Update: also repro'd on CentOS release 6.3.)
The server version/OS doesn't seem to matter much, as long as SSL is required it'll fail. Tested with many Heroku Postgres instances of different plans, and a local OS X ssl-enabled instance.
The specific SSL error PG gives me seem to vary a lot depending on successful previous queries and/or other openssl operations. I didn't try to isolate what causes the message to vary.
Varying versions of libpq and openssl made no difference; tested with both whatever is packaged on the ubuntu relases and manually compiled latest stable source releases; also changing a few compilation options for openssl (-O0, no-asm, no-threads) also made no difference.
The sample code uses a couple certificates to test verify. The certificates used make no difference, other than you need to be able to get both true and false from verify to observe the effects of each result.

Some sample code here: https://gist.github.com/9f341c7a8662cf6d9c84 (There's a comment at the bottom with the output I get when running on ubuntu.)

This sample code will always fail with the following eror:

    PG::Error: SSL error: block type is not 01
        from (irb):60:in `async_exec'

Here are a few other messages I got earlier before reducing this to a test script:

    SSL error: wrong signature length: SELECT * FROM…
    SSL error: EVP lib: SELECT * FROM…
    SSL error: data too large for modulus: SELECT * FROM…
    SSL error: data too large for modulus: SELECT * FROM…

The varying messages lead me to believe the particular message is unimportant, but that there's some state/data corruption going on.

larskanis commented 12 years ago

Original comment by Caio Chassot (Bitbucket: kch, GitHub: kch).

Oh, this non-markdown thing broke all the things. And I can't seem to find a way to edit the report. Y U NO GITHUB?

larskanis commented 12 years ago

Original comment by Jesper Noehr (Bitbucket: jespern, GitHub: jespern).

@kch There should be a big "Edit Issue" button on the top right of this screen.

larskanis commented 12 years ago

Original comment by Michael Granger (Bitbucket: ged, GitHub: ged).

I moved your (excellently formatted, and awesome) report to the ticket body myself.

I do Github, at least as a mirror, but until it supports Mercurial for version control, it's not an option for the main project site.

Anyway, thanks for the excellent report. I'll take a look this weekend.

larskanis commented 12 years ago

Original comment by Caio Chassot (Bitbucket: kch, GitHub: kch).

@jespern really can't see one. maybe need some kind of privilege? I see edit buttons for my comments.

@ged thanks for that.

BTW, also got a repro on CentOS release 6.3 (Final).

larskanis commented 12 years ago

Original comment by Jesper Noehr (Bitbucket: jespern, GitHub: jespern).

@kch Hum. You should be able to edit it since you made the report. I'll investigate.

larskanis commented 12 years ago

Original comment by Jesper Noehr (Bitbucket: jespern, GitHub: jespern).

@kch It's a bug. You should be able to edit your own issues. Fix coming.

larskanis commented 12 years ago

Original comment by Caio Chassot (Bitbucket: kch, GitHub: kch).

@jespern awesome. (yay, two bug reports in one!)

larskanis commented 12 years ago

Original comment by Jesper Noehr (Bitbucket: jespern, GitHub: jespern).

@kch There you go. You should have an edit button now.

larskanis commented 12 years ago

Original comment by Caio Chassot (Bitbucket: kch, GitHub: kch).

@jespern cool, using the edit button now. seems it will create an empty comment on every update, which can be kind of unsightly.

larskanis commented 12 years ago

Original comment by Lars Kanis (Bitbucket: larskanis, GitHub: larskanis).

After a longer debugging session, I was able to track down the cause for this wired behavior. My test environment is:

PostgreSQL 9.1.4
Ruby 1.9.3p125
Ubuntu 12.04

The OpenSSL library maintains a thread local error queue that is filled on any error that occurs within the library. Ruby's openssl extension and libpq both use the same error queue, when running in the same thread. Most errors from in the openssl extension are read out and sent as an Exception, but this is not true for Certificate#verify. It only returns true or false and leaves OpenSSL's error queue untouched.

On the other hand libpq uses SSL_get_error() to retrieve the state of the SSL connection. See [1] and [3]. This call does not inspect the thread error queue in case of blocking behavior. So PG::Connection#exec does not raise an error (at least for the simple test case). When using the async call the error queue becomes inspected and an Exception is raised for the previous verification error.

Simple solution is to use OpenSSL.errors after each failed certificate verification. Since the comment to OpenSSL.errors [2] says, that this is probably an error in the openssl extension, I'm not sure, whether something should be fixed there.

Possibly libpq should be changed in a way, that OpenSSL's thread local error queue is cleared, before every SSL operation, as described in [1].

I don't think that anything is wrong in ruby-pg.

What do you think?

[1] http://www.openssl.org/docs/ssl/SSL_get_error.html

[2] https://github.com/ruby/ruby/blob/trunk/ext/openssl/ossl.c#L349

[3] http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/interfaces/libpq/fe-secure.c;hb=HEAD#l329

larskanis commented 12 years ago

Original comment by Caio Chassot (Bitbucket: kch, GitHub: kch).

Thanks for taking a look.

Matt Zimmermann has looked into this on our end too and arrived at the same conclusion, albeit his suggestion would be to fix this in ruby's openssl library.

I think as far as ruby-pg is concerned you can close this. Let me know if you'd like to be kept in the loop as we file reports with ruby and/or postgres.

larskanis commented 12 years ago

Original comment by Lars Kanis (Bitbucket: larskanis, GitHub: larskanis).

Yes, please let me in the loop for the reports. If you would not have filed a report to ruby and postgres, I would have done it.

larskanis commented 12 years ago

Original comment by Caio Chassot (Bitbucket: kch, GitHub: kch).

To be clear, we haven't filed any reports yet.

It seems you understand the issue in more detail than we do, and could do a more thorough report. If you could please go ahead and file the issues, that would be great.

I'll then ask Zimmermann to comment on them if he has anything to add, and if need be, talk to our postgres and ruby-core contacts to make sure it gets some attention.

larskanis commented 12 years ago

Original comment by Lars Kanis (Bitbucket: larskanis, GitHub: larskanis).

You sad that Matt Zimmermann did a suggestion to fix this in ruby's openssl library. Did he talk about how it should be fixed? I currently see these variants:

Return the OpenSSL error list in Certificate#verify instead of true/false - This will change the API in an incompatible way, so it will be no real option.
Drop the error list at the end of Certificate#verify - So there will be no way to get the particular error text. Maybe add another method in the way as 1.
Add a note in the documentation that suggest the user should call OpenSSL.errors after a failed call to Certificate#verify.

I don't really like any of these, right now.

larskanis commented 12 years ago

Original comment by Caio Chassot (Bitbucket: kch, GitHub: kch).

I just forwarded you matt's email and patch at your faeriemud.org address.

larskanis commented 12 years ago

Original comment by Lars Kanis (Bitbucket: larskanis, GitHub: larskanis).

This is ged's address - can you resend to lars at greiz-reinsdorf.de?

larskanis commented 12 years ago

Original comment by Caio Chassot (Bitbucket: kch, GitHub: kch).

Done.

larskanis commented 12 years ago

Original comment by Caio Chassot (Bitbucket: kch, GitHub: kch).

Just forwarded it. Weird thing: tried posting just "Done" here and comment didn't go through.

larskanis commented 12 years ago

Original comment by Lars Kanis (Bitbucket: larskanis, GitHub: larskanis).

OK, thank you! I don't know, if you tested the Matt's patch, but it should have no effect to your sample, since that code path isn't used there. Nevertheless the change to libpq would be similar - to add ERR_clear_error() in fe-secure.c.

larskanis commented 12 years ago

Original comment by Michael Granger (Bitbucket: ged, GitHub: ged).

I agree with Lars's assessment. The documentation for SSL_get_error(3) is pretty unambiguous about the need to clear the error queue first, so it shouldn't be a hard case to make to the PostgreSQL guys.

larskanis commented 12 years ago

Original comment by Matt Zimmerman (Bitbucket: mdz, GitHub: mdz).

@larskanis my preference is to always check the return code of the call to find out if there was an error, and then use SSL_get_error to get more detail iff the call failed. Relying on the error queue to determine success or failure seems more complex to me, and creates the possibility of bugs like this one (which are impossible if the return code is checked).

larskanis commented 12 years ago

Original comment by Michael Granger (Bitbucket: ged, GitHub: ged).

While I generally agree that checking the return code is better, clearing the queue is less change for all of the cases in question, and I think the heuristic of changing as little logic as possible when submitting patches for other peoples' code trumps the error-check. Just my 2 cents.

larskanis commented 11 years ago

Original comment by Caio Chassot (Bitbucket: kch, GitHub: kch).

Just wondering, anyone opened tickets with postgres and ruby yet?

larskanis commented 11 years ago

Original comment by Lars Kanis (Bitbucket: larskanis, GitHub: larskanis).

I posted the issue to pgsql-hackers@postgresql.org some days ago, with no response till now - so I've inserted it into the patch list for the next commit fest here .

larskanis commented 11 years ago

Original comment by Lars Kanis (Bitbucket: larskanis, GitHub: larskanis).

OK, there is a bug report on ruby-redmine, too .

larskanis commented 11 years ago

Original comment by Michael Granger (Bitbucket: ged, GitHub: ged).

This is a problem with upstream code, and it's been propagated to the appropriate parties.

larskanis commented 11 years ago

Original comment by Michael Granger (Bitbucket: ged, GitHub: ged).

PostgreSQL folks rejected the patch, but the Ruby issue is still open.

larskanis commented 10 years ago

Original comment by Eric Woodruff (Bitbucket: ericwoodruff, GitHub: ericwoodruff).

larskanis, is "Add a note in the documentation that suggest the user should call OpenSSL.errors after a failed call to Certificate#verify" still valid? Just to be sure, it is not OpenSSL.errors.clear or something?

larskanis commented 10 years ago

Original comment by Lars Kanis (Bitbucket: larskanis, GitHub: larskanis).

Unfortunately neither the patch for PostgreSQL was accepted, nor the issue was addressed in Rubys OpenSSL binding. So yes, you should always call OpenSSL.errors after a failed call to Certificate#verify.

larskanis commented 10 years ago

Original comment by kritik (Bitbucket: kritik, GitHub: kritik).

We are getting the same issue. ruby 2.1.0p0 (2013-12-25 revision 44422) [x86_64-linux] ruby-pg is the latest on. We are getting this issue in particular situation. All requests are working but at this place.

What should I do to test?

larskanis commented 10 years ago

Original comment by kritik (Bitbucket: kritik, GitHub: kritik).

the same with ruby 2.1.2p95 (2014-05-08 revision 45877) [x86_64-linux] And what I have found it's present only in one method (I cannot read from database). Everything works in previus method.

UPD: Ok after deep testing I've found that async_exec (error: #<PG::ConnectionBad: PQconsumeInput() SSL error: wrong signature length>) is not working here, however exec works ok

larskanis commented 10 years ago

Original comment by kritik (Bitbucket: kritik, GitHub: kritik).

can we reopen this issue?

larskanis commented 10 years ago

Original comment by Lars Kanis (Bitbucket: larskanis, GitHub: larskanis).

@kritik This issue is still open and still not fixed in ruby-openssl: https://bugs.ruby-lang.org/issues/7215

This is not an issue in ruby-pg, so there is no reason to reopen this ticket. async_exec() is effected by this issue and exec() not, but this is only an implementation detail.

Without a change in ruby-openssl, you should generally call OpenSSL.errors after each Certificate#verify. This is also wise in order to retrieve the reason, why the certificate isn't valid.

larskanis commented 10 years ago

Original comment by kritik (Bitbucket: kritik, GitHub: kritik).

@larskanis ok, it's just weird that problem perists in one particular place and when request comes from controller in rails. From command line it's ok :) Anyway will try to push people on ruby-lang

larskanis commented 9 years ago

Original comment by darkcloud (Bitbucket: darkcloud, GitHub: darkcloud).

We were invoking Dnsruby::Dnssec.validate within Rails and encountering this issue. Invoking OpenSSL.errors when #validate raises exceptions solved it for us. Thank-you everyone for the deep debugging, elegant explanation and of course the workaround.

larskanis commented 8 years ago

Original comment by Lars Kanis (Bitbucket: larskanis, GitHub: larskanis).

Peter Geoghegan from heroku.com finally convinced the PostgreSQL team to fix this issue on their side. So it is on master branch, in 9.6-beta1 and will be backported to all maintained versions: https://commitfest.postgresql.org/9/520/

larskanis commented 6 years ago

Original comment by Lars Kanis (Bitbucket: larskanis, GitHub: larskanis).

It took some time, but in the end this issue has been solved on both the Ruby side and the PostgreSQL side .

larskanis / importtest

async_exec over ssl connection can fail on ubuntu due to weird interaction with openssl #142