stompgem / stomp

A ruby gem for sending and receiving messages from a Stomp protocol compliant message queue. Includes: failover logic, ssl support.
http://stomp.github.com
Apache License 2.0
152 stars 80 forks source link

receive failed: SSL_read:: no start line #29

Closed ripienaar closed 12 years ago

ripienaar commented 12 years ago

This is a issue I've noticed since very early releases, the reason I have not logged this before is that I do not know how to reproduce it reliably.

When connecting to ActiveMQ with SSL enabled I sometimes get the error logged (using the new callback loggers here)

`on_miscerr' Unexpected error on connection stomp://rip@stomp.xx.net:6164: receive failed:  SSL_read:: no start line

This is from the __old_receive method on connection.rb, if I tweak the code to also log the exception class I get:

`on_miscerr' Unexpected error on connection stomp://rip@stomp.xx.net:6164: receive failed: OpenSSL::SSL::SSLError: SSL_read:: no start line

I think I am finally getting near to being able to reproduce this but only in a very complex piece of code, I've not been able to create a simple bit of code that demonstrates this.

do you guys have any idea what might be the cause of this that might assist me in creating a reliable test case for this behavior?

gmallard commented 12 years ago

Are these of any interest, relevant, or helpful?

http://bugs.ruby-lang.org/issues/4550

http://pastebin.com/EjJera3K

http://www.mail-archive.com/openssl-users@openssl.org/msg00760.html

ripienaar commented 12 years ago

Kind of - I did think it's related to the BEGIN line from the cert. The problem is the gem doesnt take a cert I provide it you create those on demand dont you?

https://github.com/morellon/stomp/blob/master/lib/stomp/connection.rb#L625-650

Enabling those settings is a pull request I do want to send you but I am running the normal release that does not read a cert from disk so there isn't much I can tweak here as far as I can tell

gmallard commented 12 years ago

I was troubled by the word 'sometimes' and hoped that those links might provide examples to easily recreate.

I surely admit that the SSL open logic is ........ very basic.

My 'guess' has always been that this:

ctx.verify_mode = OpenSSL::SSL::VERIFY_NONE

meant the no real verification was done by either side - i.e. no certs expected anywhere on the wire. I may be totally off base with that thinking.

I do not see 'VERIFY_NONE' in any Rdocs I have available. I do see the definitions in 'ossl_ssl.c' in the Ruby distro:

ossl_ssl_def_const(VERIFY_NONE);
ossl_ssl_def_const(VERIFY_PEER);
ossl_ssl_def_const(VERIFY_FAIL_IF_NO_PEER_CERT);
ossl_ssl_def_const(VERIFY_CLIENT_ONCE);

Thiago - do know anything about SSL? Those commented out lines in the gem were added by you in late 2009 .......

Pienaar - Are you looking for a two-way handshake, or just one way? And yes, with the gem code like it is, there is not much you can tweak it would seem. Are you using a gem with the comments removed?

I'll do some research .... probably this weekend.

ripienaar commented 12 years ago

I am using your released code exactly at the moment so no patches or uncommented lines.

this isn't urgent - I've known of this bug for years - but it would be nice to get sorted and now I have some more time and I have access to 1000s of servers to recreate the problem.

It would be good to get to a point where only clients with valid CA certs - signed by a private CA perhaps - could connect to ActiveMQ, but I confess I do not yet know much about what is possible with ActiveMQ so I cant say what would be appropriate for the Gem yet.

gmallard commented 12 years ago

Some general comments here.

Have you made any progress in reliably recreating the original problem with a relatively simple test?

I have confirmed that the current use of VERIFY_NONE in the ssl context causes the following to occur:

VERIFY_NONE really ignores many (perhaps all) errors that occur around authentication.

AMQ configuration: if AMQ is configured with a valid keystore and truststore it will attempt to authenticate to the client. And the current use of VERIFY_NONE ignores that.

General support of SSL should support four cases:

With the current GEM, the first one listed is what is used.

One of the challenges around this is that the physical structure of Java store types is not the same and not compatible with OpenSSL stores.

Another challenge will be to consider the best way for clients to request any one of the four options. The comments show using environment variables. I have mixed feelings about that.

I currently have a gem test bed that uses mutual/two-way auth with AMQ. It took a while to get that up and running.

I will document some of the nuances of AMQ set up and requirements for the gem at length, probably on a Wiki page here.

ripienaar commented 12 years ago

Not with simple code no, but I've gotten to a point where with my complex setup I can break it reliably and know what combination of settings triggers that so I am going to work backward from there to get something that reproduces.

Been a bit snowed under this past week but it's on my list to get to the bottom of as its quite critical for me.

gmallard commented 12 years ago

FWIW - the 'no start line' error:

Does not emanate from:

Does emanate from:

And I think it is looking for ( or at ) a:

-----BEGIN CERTIFICATE-----

line.

Food for thought: I read in one report that if one had cut and pasted this from browser to editor, the result might have spaces at the end of this line, which could also result in the error message.

ripienaar commented 12 years ago

Sure, but as we discovered we're not actually using any certificates read from any files so that excludes any copy/paste issues

gmallard commented 12 years ago

That was my original thinking yes, but I now think it is incorrect. Observation tells me otherwise and that:

To show this, modify the gem, and after the line in connection.rb that says:

ssl.connect

Add only these lines:

puts ssl.verify_result
puts ssl.peer_cert
puts ssl.peer_cert.public_key

That data is on the wire at least at the beginning of the connection. And presumably is originally read by the OpenSSL libraries.

Does this problem occur later in a session? Has the connection been idle for lengthy times?

ripienaar commented 12 years ago

Oh interesting, that's good to know.

My observation tends to be that I only get it under load - say I log in and expect 50 messages incmoing of reasonable size, the connection reset happens after say 20 or 30 messages.

It's not just that simple though since I cant just reproduce it by shoving a lot of stuff into a queue and subscribing, it might be related to contents of the stomp messages.

I will go check whats in my keystores exactly, thats a very good observation.

gmallard commented 12 years ago

Checking your keystores is a good idea, it can not hurt. However .....

I am puzzled because this apparently does not occur on 'connect', only later in a session.

ripienaar commented 12 years ago

yes, I've never seen it right on connect.

gmallard commented 12 years ago

A note regarding the four flags allowed for verify mode in an SSL context.

These come directly from the OpenSSL implementation itself. I can find nowhere in Ruby code that there is any real documentation of the behavior expected from these flags.

However, OpenSSL documentation on this is another matter. A fairly good explanation of these flags can be found here:

http://www.openssl.org/docs/ssl/SSL_CTX_set_verify.html

gmallard commented 12 years ago

I know enough about this now that I will be able to replace the comments in SSL connection code, and provide connection ability to use any of the four cases listed above for general SSL capabilities.

You will see commit(s) to that effect in the reasonably near future.

At that time, I will provide examples of SSL use as well.

I will concurrently add documentation to the wiki regarding the nuances of setting up server key and trust stores for this using AMQ, and using cert data in the client.

This does not get any of us closer to being able to reproduce the problem you originally describe.

gmallard commented 12 years ago

One addendum to what is indicated in the previous message.

There will be restrictions around what is involved:

ripienaar commented 12 years ago

hmm, even if we added the CA that self signed that certs into the keystore?

I think that would be a massive hurdle to adoption

gmallard commented 12 years ago

No, not quite.

I will provide docs on adding one or more CA certs to the stores required. For clients you can create your own CA if deemed necessary, and sign client certs using your CA. Or for servers as well. I will provide examples of that on the wiki.

The point is: the certs must be signed - by your privately created CA, by some private CA you simply trust, or by some 'well known' CA. You will need to have the CA's cert to use this.

Re: self signed certs .... I know browsers allow you to wiggle through this. But right now, I think it is a huge security risk for this gem to allow clients to say 'just accept self signed certs' if the caller asks for it. Allowing certs from a CA that:

alleviates that to some extent at least.

If I am wrong in that thinking ..... convince me please.

ripienaar commented 12 years ago

No, I am happy to accept that a CA is needed as long as users own CA can be used in that role :)

thanks for clarification

gmallard commented 12 years ago

I have made some progress with both code and wiki pages. Review at your leisure please.

What JRE version(s) are you running?

'keytool' has added functionality from 1.5 -> 1.6.

ripienaar commented 12 years ago

openjdk 1.6, I think thesedays its perfectly OK to expect people to be on 1.6

I'll review wiki etc during the week, thanks a lot

gmallard commented 12 years ago

Wiki, gem code, and examples all done. Current state at HEAD of master is close to what will go out for SSL in 1.2.1.

gmallard commented 12 years ago

Somewhere above you said: "expect maybe 50 messages of reasonable size". Could you:

I am speculating ...... no real idea if this is important.

ripienaar commented 12 years ago

Yeah, example headers:

 {"expires"=>"0", "timestamp"=>"1330904173389", "content-type"=>"text/plain; charset=UTF-8", "priority"=>"4", "message-id"=>"ID:svcs1.devco.net-39490-1329136152174-5:111823:-1:1:2045", "content-length"=>"6741", "destination"=>"/queue/mcollective.reply.29119"}

It looks like these might not be base64 encoded, I'll try encode them and see if it improves matters. I can reliably reproduce it now - after 23 or so of the messages above I get the error, it auto reconnects and then continues

gmallard commented 12 years ago

I see nothing unusual about those headers. Or the message length. Was wondering if we were talking about 1M+ message sizes.

I will be very surprised if this is related to base64 encoding.

ripienaar commented 12 years ago

Is there any debugging you think I could add or packet captures or anything that might help us narrow it down? It does all look normal to me too

gmallard commented 12 years ago

I have been pondering what I could add to perhaps a test gem that might give us more information. And am at a loss as to what might be added.

Have looked at this enough to know that the error is emanating from OpenSSL code, as I indicated previously. I have been staring at that C code off and on all day. The only place I can positively identify where that error is set looks like a "hard I/O error" to me: a read is issued and nothing comes off the wire.

One thing that certainly baffles me: why is a cert expected by the SSL code during an ongoing session? I always thought the cert exchanges happened only once, at the beginning of a connection sequence. And we seem to be long past that. I can find no documentation anywhere that suggests cert exchanges in the middle of a good connection.

I hate to think about packet traces. Am not sure I would understand what I am looking at very well. Or without a lot of study .... If you want to gather some, please go ahead. We can look at them at any rate.

Personally I think this is a OpenSSL problem/bug, at least right now. I can not find any OpenSSL issues list. Or even any real discussion list.

If you are willing you might try the following:

Modify connection.rb, and in the log_params method add one line:

lparms[:socket] = @socket

You will then see that parameter in your logger call backs.

It should be an SSL socket. And so in the callback you should be able to:

puts parms[:socket].verify_result
puts parms[:socket].peer_cert

I do not think "verify_result" is going to tell us much, but I would like to see what it is.

gmallard commented 12 years ago

When this occurs, is there anything of interest in:

$ACTIVEMQ_HOME/logs/activemq.log
$ACTIVEMQ_HOME/logs/wrapper.log

??

ripienaar commented 12 years ago

From activemq logs I just get this, which seems normal since after the error the gem disconnects:

2012-03-05 12:43:32,849 [8:0:0:0:3:43972] INFO  Transport                      - Transport failed: java.io.EOFException  

When I add your extra params and print them I get

`on_miscerr' Unexpected error on connection stomp://rip@stomp:6164: receive failed: SSL_read:: no start line
`on_miscerr' 19
`on_miscerr' -----BEGIN CERTIFICATE-----
MIIEpzCCA4+gAwIBAgIBdTANBgkqhkiG9w0BAQUFADCBgTETMBEGA1UEAxMKUGlu
.
.
-----END CERTIFICATE-----
ripienaar commented 12 years ago

And indeed as you expected it's not about base64, same behavior with or without encoded payloads

gmallard commented 12 years ago

Grunt.

There are 2 possible 'good' values for 'verify_result':

0 == All cert checks complete with no errors 20 == Server sent cert, client did not authenticate, connection allowed to proceed

So ...... From the Ruby source:

19 == OpenSSL::X509::V_ERR_SELF_SIGNED_CERT_IN_CHAIN

I am not exactly sure what to make of that, but it is a new clue. Comments?

gmallard commented 12 years ago

What is in AMQ's keystore? (Not interested in the truststore here.)

If it is not too 'proprietary' can you show me the output of:

keytool --list --keystore broker.ks  -v

?

ripienaar commented 12 years ago

Nothing special in there: https://gist.github.com/1982515

I search/replaced some key words.

and indeed this is a self signed cert - have a valid CA of course, I can try add the ca to my machine wide CA list in /etc/pki, not sure if the gem cares for those

ripienaar commented 12 years ago

strace shows ruby reading this...which is a list of all my known CAs

open("/etc/pki/tls/cert.pem", O_RDONLY) = 4

So I added the CA that signed that ActiveMQ cert to that file but this didnt change anything, suspect I'll need to set some of the settings in your new code .

but if it's this I'd expect it to break not 20 or 30 packets into the connection

gmallard commented 12 years ago

I see nothing unusual about the keystore. Truth: I do not really understand 'chaining' but that is another matter, and study for another day I think.

I would not expect this failure to occur even 1 message exchange into the connection, much less 20-30. It should occur on connect if at all I think. That is the baffling aspect of this.

Do you think there anything unusual about that strace output? I do not. That did point out something I am thinking about:

on my Ubuntu system I would expect to see /etc/ssl/certs/ca-certificates.crt in strace output but I do not. Instead I see:

open("/usr/lib/ssl/cert.pem", O_RDONLY) = -1 ENOENT

Connections proceed or fail as I expect regardless.

I do not see what the new code might buy us here .... but you can pull and build or can throw you a test gem over the wall. Let me know.

gmallard commented 12 years ago

Looks to me like Ruby 1.8.5 and 1.8.6 do not have the following field:

OpenSSL::SSL::SSLSocket#verify_result

I will need to code around that when passing back connect results.

ripienaar commented 12 years ago

It would vary based on whatever you've got compiled in so for me :

% openssl version -a
OPENSSLDIR: "/etc/pki/tls"

which means it would have compiled in defaults to /etc/pki/tls/cert.pem to get the list of known certs. I guess :)

Anyway, as you say, keystore looks fine, this shouldnt happen mid connection should really happen before the socket is properly opened even, its weird...

I thought with your new code I'd specifically add the new ca into the client but not needed since I added it to my default CAs list, so indeed, not sure much will be gained there.

I did try strace to see if I can figure out what is happening, annoyingly this is not happening under strace!

gmallard commented 12 years ago

Turns out that here /usr/lib/ssl has symlinks to what I expect in /etc/ssl. But there is no 'cert.pem' file or link.

That strace output must have been massive. Sorry to hear that did not work out.

I think adding the CA to the client with the new code is a good test. It might change the behavior of the libraries later in the test.

I copied the cert data I am using to a CentOS VM, and the examples run as I expect.

Copied to Win7, also a VM, and everything fails. I will deal with that later.

Will try OSX in a little bit.

gmallard commented 12 years ago

For the record the examples results at present:

OSX 10.7.1, Ruby 1.9.2p180 - all PASS Win7, Home Premium, Ruby 1.9.2p290 (2011-07-09) [i386-mingw32] - all FAIL CentOS 5.6, Ruby 1.8.5 (2006-08-25) - all PASS

On Ubuntu 11.10 I have:

1) 185_231 (2008-06-20 patchlevel 231) - all PASS 2) 186_383 (2009-08-04 patchlevel 383) - all PASS 3) 187_000 1.8.7 (2008-04-15 patchlevel 0) - not tested 4) 187_299 (2010-06-23 patchlevel 299) - all PASS 5) 191_378 1.9.1p378 (2010-01-10) - all PASS 6) 192_188 1.9.2p188 (2011-03-28 revision 31204) - all PASS 7) 193_134 1.9.3p134 (2012-02-19 revision 34690) - all FAIL 8) ruby 2.0.0dev (2012-02-12 trunk 34563) - all FAIL

I am not too concerned about the bloody edge of the 2.0.0 trunk. But that 1.9.3 result is a bit surprising.

I looked at the above for a bit, thought about the revisions, and then on the Ubuntu box ran a build of:

ruby 1.9.2p290 (2011-07-04 revision 32477) [x86_64-linux]

And .... example test results: all FAIL

My initial conclusion is that something sure as heck happened with Ruby SSL ..... on trunk first, was merged into the 1.9.2 branch, and is still on trunk.

gmallard commented 12 years ago

This is interesting, and I document it here for posterity:

http://mathish.com/2011/01/14/openssl-in-ruby.html

ripienaar commented 12 years ago

Doing some testing of the new SSL stuff in 1.2.1 to see if using some of the use cases removes my problem, thus far up to use case 3 only.

I was wondering how error reporting is handled with the ssl connections? I think I must just be missing something.

For example when I give the incorrect/missing CA file:

ssl_params = Stomp::SSLParams.new(:ts_file => servers_CA_cert)

I'd need some way to notify my users that they did this. The Connection#socket method though that uses open_socket will catch all exceptions and not provide me any of the errors the SSL library raised even errors like Stomp::Error::SSLNoTruststoreFileError will go unreported back to the user

It's as if we need a on_misc_ssl_error or something that will also be passed an error message on these kinds of failures?

ripienaar commented 12 years ago

OK, with all use cases this happens.

I spent a lot of time getting to the bottom of this, it the SSL_read() C function will renegotiate when the server says its needed and I think its this renegotiation that fails we obviously cant affect that in our code

Anyway, so then I did a ton of tests on other rubies and on 1.9.3 I simply cannot reproduce this at all! Seems some change in ruby migh have fixed this :(

gmallard commented 12 years ago

Regarding the 'ssl error reporting' comment:

There may be some reasonable logic missing here. Your 'on_connectfail' callback is driven, correct? But there is no really useful information available in the params, correct?

Let me think about it, but if you have a reasonable solution, please ask for a pull.

Regarding the original problem:

I am not totally surprised that this happens with all use cases. That implementation needed to be done I think, but did not really address your original problem.

The info on the SSL_read C function is interesting. However, needless to say: We are not going to hack ruby builds as part of this gem.

It is interesting that you say 'other rubies and 1.9.3' do not reproduce the problem. My conclusion about that is that it is specific to:

Given that I am really unsure as to what to advise you about the original problem, other that to say: upgrade something!

As an aside: OpenSSL and Ruby versions are an absolute and utter mess at this point in time. I discovered that when trying to get all this SSL work done.

At some point in Ruby development, they discovered 'SSLv2' was 'bad'. They do development on 'trunk'. They disabled SSLv2 on 'trunk' and at some point that was merged into all of the 1.9.x branches.

Almost concurrently, some distributions (particularly Debian based), also discovered that SSLv2 was 'bad'. They disabled SSLv2 in subsequent OpenSSL builds.

Meanwhile, Ruby builds provided no way to say 'SSLv2 = yes/no'.

So 'old' Rubies will not build at all on a system with SSLv2 disabled. And newer Rubies insist that is must be disabled. And that is followed by people asking Ruby dev team to somehow optionally allow builds with SSLv2. Which they are trying to do on all 1.9.x branches I think. That may actually be done, but without docs.

All of that leaves folks like me in a hard place: I have Ruby builds (1.8.x) that were done against an OpenSSL that included V2. Those builds now cannot be recreated because my current OpenSSL has V2 disabled. If I build a local OpenSSL with V2 enabled and build Ruby, that seems to work. But it causes havoc with other applications, particularly 'git'. Something I can not live with.

Grouch, grouch, grouch. I'll think about the ssl error reporting, and you should as well.

ripienaar commented 12 years ago

Sorry, I was not clear - I mean other rubies does it, 1.9.3 does not (that was the only 1.9.x I tried).

This has been reported by each of my users who tried SSL + Stomp on all sorts of OS/distro/version etc, its widespread. So not unique to my environment at all.

I'll script up something to do a lot of git bisect tests first of the ruby releases to identify the one that fixes it and then its commits to track it down.

Agree this is most likely not something we can fix in the stomp gem :( I found http://bugs.ruby-lang.org/issues/4550 interesting and backported the fix there, no change - I know we're not reading keys in the stomp gem after initializing but I do read keys in my code and that might be exactly it.

I'll open another issue for the ssl logging

gmallard commented 12 years ago

What 1.9.3 patch level did you use that does not show the problem?

Does 1.9.3 patch level 0 show that bug? I would try that first to get a 'bad' for bisect.

If you can find the commit that alleviates your problem, perhaps the gem can 'code around' it for some Ruby instances.

I have to ask ..... what are you doing with public keys in your code? What is the purpose of that?

That fix for issue 4550 you mention above went on to trunk on Apr 5, 2011. Is is followed by a string of commits from same author that are related. In the Ruby repo mirror at http://github.com/ruby/ruby.git this starts at e61d269f343f and ends at dd11a58be3. I do not know if that is relevant, but if you only back ported the one commit that may not be enough.

A different issue for ssl logging would be good.

ripienaar commented 12 years ago

It was ruby-1.9.3-p0, I'll have a play next week around those commits thanks

The code in question is a RPC system, the RPC requests are encrypted (or signed) using a users key prior to transmission via stomp and on the receiving side validated/decrypted. So we do lots of RSA in the application that uses the gem.

If I use a security plugin in my code that doesnt use any RSA on my side but just simple pre-shared md5 based hashes then the problem does not appear on any ruby - so i am pretty confident its the bug I mentioned.

I'll test some more later

gmallard commented 12 years ago

See if the latest push is usable for logging exceptions for SSL. Implements the 'on_ssl_connectfail' callback capability.

gmallard commented 12 years ago

Your problem seems to be a 'SSL renegotiation' issue as it does not occur on initial connect.

This might be of interest to you:

http://www.oracle.com/technetwork/java/javase/documentation/tlsreadme2-176330.html

gmallard commented 12 years ago

So ..... the RPC requests are encrypted, and that encrypted request is sent in the Stomp message 'body'?

ripienaar commented 12 years ago

yes, we need to be able to securely identify the sender, so receivers have pubkeys and validate the signatures of the body

Thanks for the link to the oracle doc, I'll work through that

ripienaar commented 12 years ago

With unsafe reneg enabled via -Dsun.security.ssl.allowUnsafeRenegotiation=true it still happens, so thats with the mose insecure settings on the JVM