Closed ripienaar closed 11 years ago
I jus tnoticed 1.2.5 is out and some work was done with socket, sorry if this was fixed there - will update and keep an eye on it though for me it happens like once a month
confirmed to still happen on 1.2.5, whenever there's some bad networking with lots of packet loss, timeouts and such - things seem to go wrong and it will never auto reconnect then.
Hmmmm .... can you please take a look at the comments in issue #47, and also at the commit that closed 47?
I tested that by:
a) Establishing a connection b) Attempt a publish every so often c) Kill the broker d) Verify @closed is true e) Restart the broker and verify that a connection is reestablished
I am missing something here I think. The retries should all be synchronous in the sense they take place immediately if the connection fails. If max retry counts are exceeded then you are done. The connection is dead.
So I do not understand your comment about "other methods" taking care of reconnection attempts. Are multiple threads involved in what you are doing?
Right now I have absolutely no idea how to recreate this.
I will review that fix for issue #47, and sanity check that again.
What I mean by other connections taking of reconnection is given the code https://github.com/stompgem/stomp/blob/master/lib/stomp/connection.rb#L361-370
This is effectively saying:
def publish(destination, message, headers = {}) # raise if not connected # prepare message # transmit message via a call to socket end
The last step - transmit message - will call the socket method, and the socket method will attempt to reconnect if the connection failed
However if for whatever reason the @closed property is set to true by the time you enter this method it will raise in the first step, never attempting to call socket and so never kicking in the reconnect logic again.
I might be missing something but it seems to me that in some circumstances this happens, @closed is set to true when it enters these messages and then forever its just raising instead of reconnecting.
I cannot recreate by forcibly restarting things or killing brokers, it seems to happen only during bad network conditions like high packet loss, routing errors etc so would be hard to debug.
Some behaviour have changed in 1.2.5 and I am seeing new logs now so I guess I need to monitor it more and wait patiently for my next network event before I can really get into it. It's a really difficult thing to try to reproduce :(
Yep, I understand what you are saying about a "raise if not connected".
I just reran my "broker kill" test, and that functions as expected.
When you connect are you using:
:max_reconnect_attempts => ????
I set max attempts to 0
With max recon attempts == 0, you should see a "retry forever" behavior somewhere. I am not understanding why it seems that is not the case.
Do you have any logs from one of these events that you could share?
It the connection shared across multiple threads ?
yes, its shared across threads. One is almost constantly in c.receive while others publish.
I'll try to introduce packet loss this weekend and see if I can reproduce it and run my system at debug for the time.
Given that, your description makes a lot more sense to me.
I suspect one thread, likely the receiver thread is in a retry loop.
The publisher threads do not 'know' that, and they see the 'raise if not connected' behavior.
Do I understand correctly: you first observed this using base 1.2.4 ?
If yes, then I am still somewhat baffled.
This went in between 1.2.4 and 1.2.5:
https://github.com/stompgem/stomp/commit/270512bf9d1c0ed8e2e3f6d1425527544cab3fce
Comments on that commit?
Yes, 1.2.4 and probably earlier did this, i just last night upgraded to 1.2.5 and will have to see if that improves things
Given the differences between 1.2.4 and 1.2.5, I do not hold out much hope for an improvement .......
Your receiver thread ...... what ack mode is the subscription using? Does the receive logic check for 'ERROR' frames and handle them somehow? I would like to know if your receiver has seen an 'ERROR' frame from AMQ that indicates the connection is absolutely dead.
I have been playing, and have one hacked example where AMQ will actively refuse a reconnect because it has not handled an ACK during shutdown / kill. This particular example reconnects fine to Apollo and RabbitMQ.
Sunday afternoon ..... I am going to the pub for a while.
I do log but did not seeing anythig related to error frames now. I have it logging at a higher level now and will just hve to see, its unfortunately over SSL so tcpdump isnt much use
doesnt happen that often for me, will just have to wait it out
When this kind of network / broker meltdown occurs, I really need to know if one of your threads is in a infinite retry loop. I believe that should be true, given :max_reconnect_attempts => 0.
The gem originally implemented the 'reliable' concept with 2557ad1 on 20060429. Quite some time ago.
Given your description of the application's design and implementation, I doubt that any gem version could have ever gracefully handled the kind of network / broker behavior you are reporting.
IMO, at a bare minimum, your publish threads need to check @closed before attempting to 'conn.publish'. And do something sane if the check is true. Go in to a sleep / retry loop maybe?
tcpdump and other low level tools .... may introduce behavior that does not match reality. I am sure you are aware of that ......
tcpdumping a connection really shouldnt be affecting it, on unix at least this is completely unrelated to the app unless the apps specifically changes its behaviour in the presence of a promisc network
anyway, I've run this specific program with the same design for 3 years now on much worse networks and only started noticing a situation where it gets in a situation it cant reconnect since 1.2.x
I really cant add much at this point of use till it happens again and i get better logs
Hmmmmm ... since 1.2.x.
That is interesting.
There was a lot of logic added between 1.1.10 and 1.2.0 course. It was an extensive change.
That was all of the initial Stomp 1.1 protocol level logic, as well as other changes. Others were:
f9c3a36 Raise on duplicate subscribe for reliable connections. 7dd9397 Raise if called after Connection#disconnect or Client#close. 3d5b9a2 Add connect timeout override with hashed parameters.
And possibly:
3d5b9a2 Add connect timeout override with hashed parameters.
If you can run a custom build and test again, I suggest trying a 'revert' of 7dd9397 in particular. It would be an interesting experiment.
When Stomp 1.1 started, the initial thinking behind 7dd9397 was that it would prevent gem users from 'shooting themselves in the foot'.
I still think that reasoning is sound.
However in the next version I will provide a way for that behavior to be nullified on a per connection basis.
Had one machine do it again today. It was a pretty low spec vmware based VM and it was having quite high CPU and it's network connection was having issues, logs I got was:
activemq.rb:131:in `on_ssl_connectfail' SSL session creation with stomp+ssl://x@x:6165 failed: execution expired activemq.rb:116:in `on_miscerr' Unexpected error on connection stomp+ssl://x@x:6165: transmit to x failed: SSL_write:: bad write retry activemq.rb:116:in `on_miscerr' Unexpected error on connection stomp+ssl://x@x:6165: transmit to x failed: SSL_write:: bad write retry
And after this it was game over, so I do think this is multi threaded related. Either option to nullify it or a support way to attempt reconnect would be fine, in my case I dont care if these messages get sent when the network is lossy or the machine is struggling I mostly care that the connection recovers on its own when things return to normal.
It is unclear to me that nullifying the checks for closed? is actually going to help in your case. Nevertheless expect a trial gem shortly.
The first log message that contains 'execution expired' has me wondering.... I am now considering the possibility that this line of code:
ssl.connect
should perhaps be something like:
Timeout::timeout(@connect_timeout, Stomp::Error::SocketOpenTimeout) do
ssl.connect
end
Please confirm/deny: you are running with
:connect_timeout => 0
Please add to your logging code for next time and show me:
parms[:ssl_exception]
I am running with connect_timeout 0 yes, will add that logging
Oh, I do log the ssl_exception it's in the logs above:
Log.error("SSL session creation with #{stomp_url(params)} failed: #{params[:ssl_exception]}")
lol. Sorry, I really meant show me this:
parms[:ssl_execption].backtrace.join("\n")
OK :)
You have described .... a total network meltdown as far as I can see.
Presumably you want retries to occur ..... well, forever essentially.
Consider: Under the following conditions:
nothing is likely to work very well. If a heartbeat send is actually required, the current heartbeat send code (a separate thread) does:
Kaboom. You are done.
I need to think about that case I guess..... but you also should.
So I am still not running 1.1 here - activemq bug - but will get there. It's possible that the keep alive thread will try to reconnect when running in that mode, I did not check.
But yes, its a total network meltdown and later recovery. For me and others I know who use it the stomp gem is amazing because you can just tell it to connect and forget about it, it always retries, never gets into a place it cant retry and just basically allow you to build software you can forget about. This behavior changes that.
Ofcourse when the network is out there's nothing to be done, message fail to send etc, totally kewl - but it recovers on its own without programmer/admin/user interaction after the network event and that was a major bonus of this gem - apart from that it speaks STOMP :P
AMQ 5.7.0 seems flaky to me. I have tests that have been running for years that fail when using it. Tests that have nothing to do with this discussion.
Adding a documentation note at this time. I have been unable to recreate this using test strategies described below.
Strategy 1: Broker begins in a stable network environment. Multiple long running multi-threaded publishers and consumers are started. A script that loops endlessly is started that does:
sleep n seconds
broker stop
sleep m seconds
broker start
Strategy 2: Broker begins in a stable network environment. Multiple long running multi-threaded publishers and consumers are started. A script that loops endlessly is started that does:
sleep x seconds
sudo iptables ..... # start dropping packets
sleep y seconds
sudo iptables .... # do not drop packets
Strategy 3: A combination / merge of Strategy 1 and 2.
Using Strategy 3 with poorly (or well?) chosen sleep values, it is possible to encounter situations where a client will take a significant amount of time to actually reconnect. However, in all cases a successful reconnect does eventually happen.
Am certainly open to new suggestions for recreating this.
Note: during this effort I did uncover the bug that reconnect_delay was not reinitialized upon a successful reconnect. That is fixed, but only in the latest gem version. See: 909937101b18e285d4fd2cabebd9c3200c725dc1
I am still alive, been traveling a lot so not had time to devote to this.
However I now have 2 hosts who does this daily so at least I have somewhere that I can play and try various things. Upgraded to your latest release, set the closed_check option and will monitor.
I should probably chat with you about my general approach re connection/sending/sharing connection between threads at some point as I've found a few more weird things and I think it's because I am just doing it wrong or based on wrong assumptions
I will continue to monitor this traffic of course.
As noted in my most previous post I have been totally unable to recreate ..... what you are apparently seeing.
We can certainly discuss your approach. From the somewhat minimal description you have provided ..... at first blush I see nothing inherently wrong with it.
From a multi-thread perspective it seems there are several basic approaches:
1) Single connection, shared across threads 2) Multiple connections, one per thread 3) A hybrid of 1) and 2)
I think your approach is using option 1).
Does this app use failover, like
hash = {
:hosts => [
# First connect is to remotehost1
{:login => "login1", :passcode => "passcode1", :host => "remotehost1", :port => 61612, :ssl => true},
# First failover connect is to remotehost2
{:login => "login2", :passcode => "passcode2", :host => "remotehost2", :port => 61613, :ssl => false},
], .....
????
not in mine, but earlier this week at a client I saw failover also not working in 1.2.x where it worked with the same code in 1.1.x.
Once I am back from my travels I'll attempt to reproduce that with a simple bit of code I think that should be easier than this one from what I saw
One more question - since I am on a fishing expedition - are these connections straight TCP or SSL or either?
yes, ssl - the failover issue I had was also with SSL fwiw but I have not started on reproducing that one
I want to try and set up an environment here that is as close to yours as possible. So:
a) What version of AMQ ?? b) Is AMQ running with ?needClientAuth=true ?? c) Are there any .... unusual ... AMQ configuration parameters specified ?? d) Ruby version - one more time please :-)
Another Q. What is your transport scheme? I am thinking:
stomp+nio+ssl:// ........
but want to confirm.
XML file @ http://p.devco.net/268/ I just removed the auth and acl bits - these connections are to the verified_stompssl port
ruby 1.8.7 (2011-06-30 patchlevel 352) [x86_64-linux]
AMQ Version ????
Might be relevant ......
sorry - it your much hated 5.7, though i had the same in 5.6 iirc.
My 2 machines that were doing this daily for like a week has now stop doing it all on their own, so frustrating not to be able to gather info or means of replication from them
So ..... I am thinking I should close this. It can be re-opened at any time of course.
Do you have any updates?
no, i had a week where it happened like every few hours in november, then it just stopped and never came back - infuriating.
close it, I'll open again if i ever get proper data
I will close it for now.
Ya' know .... my day to day is in a shop with a very large .NET and Java infrastructure. We use all sorts of tech to do failure detection, fail-over, and as a side effect load balancing.
I think you have some kind of infrastructure problem. (Many possibilities: flaky NICs. flaky switches, unexpected network changes (routing table updates maybe?), ???? the possibilities are almost endless.
Just my opinion at present.
Regards, Guy
totally - my machines are at lots of small ISPs and communicate over the internet - it's guaranteed to suck and have issues. Stomp gem pretty much always retries solidly which is why i like it :)
hello,
In some circumstances @closed gets set to false and remains false, I think this happens when an exception gets raised during open_socket after setting it to false.
Once its set to false and even if reliable is true etc - stomp gem will never try to reconnect since methods like receive, poll, disconnect, unreceive, publish, unsubscribe and more or less all others will raise the NoCurrentConnection error before doing anything with #socket that would take care of reconnection attempts.
The end result is that forever the stomp gem just raises NoCurrentConnection instead of trying to reconnect.
I can work around this by catching NoCurrentConnection outside of the gem but it feels to me that at least with reliable set true the gem should retry connections rather than raising this exception.
affects version 1.2.4