stompgem / stomp

A ruby gem for sending and receiving messages from a Stomp protocol compliant message queue. Includes: failover logic, ssl support.
http://stomp.github.com
Apache License 2.0
152 stars 80 forks source link

NoCurrentConnection #49

Closed ripienaar closed 11 years ago

ripienaar commented 12 years ago

hello,

In some circumstances @closed gets set to false and remains false, I think this happens when an exception gets raised during open_socket after setting it to false.

Once its set to false and even if reliable is true etc - stomp gem will never try to reconnect since methods like receive, poll, disconnect, unreceive, publish, unsubscribe and more or less all others will raise the NoCurrentConnection error before doing anything with #socket that would take care of reconnection attempts.

The end result is that forever the stomp gem just raises NoCurrentConnection instead of trying to reconnect.

I can work around this by catching NoCurrentConnection outside of the gem but it feels to me that at least with reliable set true the gem should retry connections rather than raising this exception.

affects version 1.2.4

ripienaar commented 12 years ago

I jus tnoticed 1.2.5 is out and some work was done with socket, sorry if this was fixed there - will update and keep an eye on it though for me it happens like once a month

ripienaar commented 12 years ago

confirmed to still happen on 1.2.5, whenever there's some bad networking with lots of packet loss, timeouts and such - things seem to go wrong and it will never auto reconnect then.

gmallard commented 12 years ago

Hmmmm .... can you please take a look at the comments in issue #47, and also at the commit that closed 47?

I tested that by:

a) Establishing a connection b) Attempt a publish every so often c) Kill the broker d) Verify @closed is true e) Restart the broker and verify that a connection is reestablished

I am missing something here I think. The retries should all be synchronous in the sense they take place immediately if the connection fails. If max retry counts are exceeded then you are done. The connection is dead.

So I do not understand your comment about "other methods" taking care of reconnection attempts. Are multiple threads involved in what you are doing?

Right now I have absolutely no idea how to recreate this.

I will review that fix for issue #47, and sanity check that again.

ripienaar commented 12 years ago

What I mean by other connections taking of reconnection is given the code https://github.com/stompgem/stomp/blob/master/lib/stomp/connection.rb#L361-370

This is effectively saying:

def publish(destination, message, headers = {})
   # raise if not connected
   # prepare message
   # transmit message via a call to socket
end

The last step - transmit message - will call the socket method, and the socket method will attempt to reconnect if the connection failed

However if for whatever reason the @closed property is set to true by the time you enter this method it will raise in the first step, never attempting to call socket and so never kicking in the reconnect logic again.

I might be missing something but it seems to me that in some circumstances this happens, @closed is set to true when it enters these messages and then forever its just raising instead of reconnecting.

I cannot recreate by forcibly restarting things or killing brokers, it seems to happen only during bad network conditions like high packet loss, routing errors etc so would be hard to debug.

Some behaviour have changed in 1.2.5 and I am seeing new logs now so I guess I need to monitor it more and wait patiently for my next network event before I can really get into it. It's a really difficult thing to try to reproduce :(

gmallard commented 12 years ago

Yep, I understand what you are saying about a "raise if not connected".

I just reran my "broker kill" test, and that functions as expected.

When you connect are you using:

:max_reconnect_attempts => ????

ripienaar commented 12 years ago

I set max attempts to 0

gmallard commented 12 years ago

With max recon attempts == 0, you should see a "retry forever" behavior somewhere. I am not understanding why it seems that is not the case.

Do you have any logs from one of these events that you could share?

It the connection shared across multiple threads ?

ripienaar commented 12 years ago

yes, its shared across threads. One is almost constantly in c.receive while others publish.

I'll try to introduce packet loss this weekend and see if I can reproduce it and run my system at debug for the time.

gmallard commented 12 years ago

Given that, your description makes a lot more sense to me.

I suspect one thread, likely the receiver thread is in a retry loop.

The publisher threads do not 'know' that, and they see the 'raise if not connected' behavior.

Do I understand correctly: you first observed this using base 1.2.4 ?

If yes, then I am still somewhat baffled.

This went in between 1.2.4 and 1.2.5:

https://github.com/stompgem/stomp/commit/270512bf9d1c0ed8e2e3f6d1425527544cab3fce

Comments on that commit?

ripienaar commented 12 years ago

Yes, 1.2.4 and probably earlier did this, i just last night upgraded to 1.2.5 and will have to see if that improves things

gmallard commented 12 years ago

Given the differences between 1.2.4 and 1.2.5, I do not hold out much hope for an improvement .......

gmallard commented 12 years ago

Your receiver thread ...... what ack mode is the subscription using? Does the receive logic check for 'ERROR' frames and handle them somehow? I would like to know if your receiver has seen an 'ERROR' frame from AMQ that indicates the connection is absolutely dead.

I have been playing, and have one hacked example where AMQ will actively refuse a reconnect because it has not handled an ACK during shutdown / kill. This particular example reconnects fine to Apollo and RabbitMQ.

Sunday afternoon ..... I am going to the pub for a while.

ripienaar commented 12 years ago

I do log but did not seeing anythig related to error frames now. I have it logging at a higher level now and will just hve to see, its unfortunately over SSL so tcpdump isnt much use

doesnt happen that often for me, will just have to wait it out

gmallard commented 12 years ago

When this kind of network / broker meltdown occurs, I really need to know if one of your threads is in a infinite retry loop. I believe that should be true, given :max_reconnect_attempts => 0.

The gem originally implemented the 'reliable' concept with 2557ad1 on 20060429. Quite some time ago.

Given your description of the application's design and implementation, I doubt that any gem version could have ever gracefully handled the kind of network / broker behavior you are reporting.

IMO, at a bare minimum, your publish threads need to check @closed before attempting to 'conn.publish'. And do something sane if the check is true. Go in to a sleep / retry loop maybe?

tcpdump and other low level tools .... may introduce behavior that does not match reality. I am sure you are aware of that ......

ripienaar commented 12 years ago

tcpdumping a connection really shouldnt be affecting it, on unix at least this is completely unrelated to the app unless the apps specifically changes its behaviour in the presence of a promisc network

anyway, I've run this specific program with the same design for 3 years now on much worse networks and only started noticing a situation where it gets in a situation it cant reconnect since 1.2.x

I really cant add much at this point of use till it happens again and i get better logs

gmallard commented 12 years ago

Hmmmmm ... since 1.2.x.

That is interesting.

There was a lot of logic added between 1.1.10 and 1.2.0 course. It was an extensive change.

That was all of the initial Stomp 1.1 protocol level logic, as well as other changes. Others were:

f9c3a36 Raise on duplicate subscribe for reliable connections. 7dd9397 Raise if called after Connection#disconnect or Client#close. 3d5b9a2 Add connect timeout override with hashed parameters.

And possibly:

3d5b9a2 Add connect timeout override with hashed parameters.

If you can run a custom build and test again, I suggest trying a 'revert' of 7dd9397 in particular. It would be an interesting experiment.

gmallard commented 12 years ago

When Stomp 1.1 started, the initial thinking behind 7dd9397 was that it would prevent gem users from 'shooting themselves in the foot'.

I still think that reasoning is sound.

However in the next version I will provide a way for that behavior to be nullified on a per connection basis.

ripienaar commented 12 years ago

Had one machine do it again today. It was a pretty low spec vmware based VM and it was having quite high CPU and it's network connection was having issues, logs I got was:

activemq.rb:131:in `on_ssl_connectfail' SSL session creation with stomp+ssl://x@x:6165 failed: execution expired
activemq.rb:116:in `on_miscerr' Unexpected error on connection stomp+ssl://x@x:6165: transmit to x failed: SSL_write:: bad write retry
activemq.rb:116:in `on_miscerr' Unexpected error on connection stomp+ssl://x@x:6165: transmit to x failed: SSL_write:: bad write retry

And after this it was game over, so I do think this is multi threaded related. Either option to nullify it or a support way to attempt reconnect would be fine, in my case I dont care if these messages get sent when the network is lossy or the machine is struggling I mostly care that the connection recovers on its own when things return to normal.

gmallard commented 12 years ago

It is unclear to me that nullifying the checks for closed? is actually going to help in your case. Nevertheless expect a trial gem shortly.

gmallard commented 12 years ago

The first log message that contains 'execution expired' has me wondering.... I am now considering the possibility that this line of code:

ssl.connect

should perhaps be something like:

Timeout::timeout(@connect_timeout, Stomp::Error::SocketOpenTimeout) do
    ssl.connect
 end

Please confirm/deny: you are running with

:connect_timeout => 0

Please add to your logging code for next time and show me:

parms[:ssl_exception]
ripienaar commented 12 years ago

I am running with connect_timeout 0 yes, will add that logging

ripienaar commented 12 years ago

Oh, I do log the ssl_exception it's in the logs above:

Log.error("SSL session creation with #{stomp_url(params)} failed: #{params[:ssl_exception]}") 
gmallard commented 12 years ago

lol. Sorry, I really meant show me this:

parms[:ssl_execption].backtrace.join("\n")
ripienaar commented 12 years ago

OK :)

gmallard commented 12 years ago

You have described .... a total network meltdown as far as I can see.

Presumably you want retries to occur ..... well, forever essentially.

Consider: Under the following conditions:

nothing is likely to work very well. If a heartbeat send is actually required, the current heartbeat send code (a separate thread) does:

Kaboom. You are done.

I need to think about that case I guess..... but you also should.

ripienaar commented 12 years ago

So I am still not running 1.1 here - activemq bug - but will get there. It's possible that the keep alive thread will try to reconnect when running in that mode, I did not check.

But yes, its a total network meltdown and later recovery. For me and others I know who use it the stomp gem is amazing because you can just tell it to connect and forget about it, it always retries, never gets into a place it cant retry and just basically allow you to build software you can forget about. This behavior changes that.

Ofcourse when the network is out there's nothing to be done, message fail to send etc, totally kewl - but it recovers on its own without programmer/admin/user interaction after the network event and that was a major bonus of this gem - apart from that it speaks STOMP :P

gmallard commented 11 years ago

AMQ 5.7.0 seems flaky to me. I have tests that have been running for years that fail when using it. Tests that have nothing to do with this discussion.

gmallard commented 11 years ago

Adding a documentation note at this time. I have been unable to recreate this using test strategies described below.

Strategy 1: Broker begins in a stable network environment. Multiple long running multi-threaded publishers and consumers are started. A script that loops endlessly is started that does:

sleep n seconds
broker stop
sleep m seconds
broker start

Strategy 2: Broker begins in a stable network environment. Multiple long running multi-threaded publishers and consumers are started. A script that loops endlessly is started that does:

sleep x seconds
sudo iptables ..... # start dropping packets
sleep y seconds
sudo iptables .... # do not drop packets

Strategy 3: A combination / merge of Strategy 1 and 2.

Using Strategy 3 with poorly (or well?) chosen sleep values, it is possible to encounter situations where a client will take a significant amount of time to actually reconnect. However, in all cases a successful reconnect does eventually happen.

Am certainly open to new suggestions for recreating this.

Note: during this effort I did uncover the bug that reconnect_delay was not reinitialized upon a successful reconnect. That is fixed, but only in the latest gem version. See: 909937101b18e285d4fd2cabebd9c3200c725dc1

ripienaar commented 11 years ago

I am still alive, been traveling a lot so not had time to devote to this.

However I now have 2 hosts who does this daily so at least I have somewhere that I can play and try various things. Upgraded to your latest release, set the closed_check option and will monitor.

I should probably chat with you about my general approach re connection/sending/sharing connection between threads at some point as I've found a few more weird things and I think it's because I am just doing it wrong or based on wrong assumptions

gmallard commented 11 years ago

I will continue to monitor this traffic of course.

As noted in my most previous post I have been totally unable to recreate ..... what you are apparently seeing.

We can certainly discuss your approach. From the somewhat minimal description you have provided ..... at first blush I see nothing inherently wrong with it.

From a multi-thread perspective it seems there are several basic approaches:

1) Single connection, shared across threads 2) Multiple connections, one per thread 3) A hybrid of 1) and 2)

I think your approach is using option 1).

gmallard commented 11 years ago

Does this app use failover, like

hash = {
  :hosts => [
    # First connect is to remotehost1
    {:login => "login1", :passcode => "passcode1", :host => "remotehost1", :port => 61612, :ssl => true},
    # First failover connect is to remotehost2
    {:login => "login2", :passcode => "passcode2", :host => "remotehost2", :port => 61613, :ssl => false},

  ], .....

????

ripienaar commented 11 years ago

not in mine, but earlier this week at a client I saw failover also not working in 1.2.x where it worked with the same code in 1.1.x.

Once I am back from my travels I'll attempt to reproduce that with a simple bit of code I think that should be easier than this one from what I saw

gmallard commented 11 years ago

One more question - since I am on a fishing expedition - are these connections straight TCP or SSL or either?

ripienaar commented 11 years ago

yes, ssl - the failover issue I had was also with SSL fwiw but I have not started on reproducing that one

gmallard commented 11 years ago

I want to try and set up an environment here that is as close to yours as possible. So:

a) What version of AMQ ?? b) Is AMQ running with ?needClientAuth=true ?? c) Are there any .... unusual ... AMQ configuration parameters specified ?? d) Ruby version - one more time please :-)

gmallard commented 11 years ago

Another Q. What is your transport scheme? I am thinking:

stomp+nio+ssl:// ........

but want to confirm.

ripienaar commented 11 years ago

XML file @ http://p.devco.net/268/ I just removed the auth and acl bits - these connections are to the verified_stompssl port

ruby 1.8.7 (2011-06-30 patchlevel 352) [x86_64-linux]

gmallard commented 11 years ago

AMQ Version ????

Might be relevant ......

ripienaar commented 11 years ago

sorry - it your much hated 5.7, though i had the same in 5.6 iirc.

My 2 machines that were doing this daily for like a week has now stop doing it all on their own, so frustrating not to be able to gather info or means of replication from them

gmallard commented 11 years ago

So ..... I am thinking I should close this. It can be re-opened at any time of course.

Do you have any updates?

ripienaar commented 11 years ago

no, i had a week where it happened like every few hours in november, then it just stopped and never came back - infuriating.

close it, I'll open again if i ever get proper data

gmallard commented 11 years ago

I will close it for now.

Ya' know .... my day to day is in a shop with a very large .NET and Java infrastructure. We use all sorts of tech to do failure detection, fail-over, and as a side effect load balancing.

I think you have some kind of infrastructure problem. (Many possibilities: flaky NICs. flaky switches, unexpected network changes (routing table updates maybe?), ???? the possibilities are almost endless.

Just my opinion at present.

Regards, Guy

ripienaar commented 11 years ago

totally - my machines are at lots of small ISPs and communicate over the internet - it's guaranteed to suck and have issues. Stomp gem pretty much always retries solidly which is why i like it :)