stompgem / stomp

A ruby gem for sending and receiving messages from a Stomp protocol compliant message queue. Includes: failover logic, ssl support.
http://stomp.github.com
Apache License 2.0
152 stars 80 forks source link

Use a socket timeout instead of Ruby's timeout #121

Closed rrosenblum closed 8 years ago

rrosenblum commented 8 years ago

Ruby's timeout has some serious issues. Rather than using Ruby's timeout for socket timeouts, it would be better to use setting on the socket itself or IO.select.

I was recently profiling a worker that we have that sends messages out to ActiveMQ. During the profiling, I found that nearly 70% of all the memory being allocated for this worker comes from Ruby's timeout. It is a very simple resque worker that gets a message and writes it to ActiveMQ using stomp. We maintain the same Stomp::Client for all of the messages that we send. I profiled this worker with 1000 messages. The total allocated memory was 1527649456 bytes. The total memory being allocated by timeout was 1052953152 bytes.

Example of using timeout on the socket - https://www.mikeperham.com/2009/03/15/socket-timeouts-in-ruby/ Example using IO.select - https://spin.atomicobject.com/2013/09/30/socket-connection-timeout-ruby/

gmallard commented 8 years ago

Thanks for the report.

We will look into this.

My personal recollection is that IO.select was used at one time (a number of years ago), but was abandoned because:

a) It failed on Windows, or b) It failed on JRuby, or c) both a) and b)

Out of curiosity, what profiling tool(s) were used to obtain your results ?

rrosenblum commented 8 years ago

The gem that I used is MemoryProfiler. So far, it has been the best gem that I can find for memory profiling.

That is a shame if IO.select didn't work on Windows or JRuby. It is hopefully still possible to accomplish a timeout using socket options.

gmallard commented 8 years ago
Timeout::timeout ....

is called five places in gem code ( ignoring unit tests ).

For each of:

Are your results 'localized', or for all calls ?

rrosenblum commented 8 years ago

Unfortunately, the results that I was looking at were generalized to Timeout.timeout. I think this is at least partially due to the fact that Timeout.timeout is written in Ruby so MemoryProfiler is able to count it as a source.

For the code that I was running, I was persisting the client and looping over a set of messages. Most likely, because of how this was running, connect and start timeout would be negligible. The code that would be executed most is frame parse and socket read. I did memory profile a modified version of my code that reconnected for every message that was sent. This caused the allocated memory to triple that of a persisted connection.

I can attempt to rerun a profile excluding Ruby source code. I do not know if this will have the effect of hiding the results from Ruby source or if it will cause all of the results to point one level higher.

As far as I know, there should be socket level settings to handle the timeout of connect and read. I would think there is also a socket setting for frame parse. There is also a socket setting for connect timeout. Looking at the code, I assume that the main reason for the start timeout in Client is because of the create_connection. If we can get a socket level connect timeout to work, then there should be no need for the Ruby timeout anymore.

gmallard commented 8 years ago

I have not forgotten about this.

I spent some time this week end starting to put together a test bed, in order to get some before and after results.

I confused myself for a while because of the similarity of gem names:

In any case I would like a little more information from you.

I really think I need to know a little more about the overall design and implementation of your application. Is the application source available to the public?

Your statement that TImeout.timeout was written in Ruby surprised me. I am looking at a ruby 2.3.2p113 system and you are indeed correct. My surprise was because I have looked at Timeout.timeout before, a few years ago. At one point (1.8 or 1.9 maybe) I am pretty confident it was written in C. I wonder why the Ruby developers changed that.

rrosenblum commented 8 years ago

Unfortunately the project that I working on is closed source. I will work on putting together a generic implementation of the code to give to you. To give you some more information on this, the reason that I started looking into this is that we found that our ActiveMQ workers were leaking memory. Running memory_profiler, the retained memory pointed to lib/connection/netio.rb:176 (used_socket.write body unless body == '') being the issue. It seems as though maintaining a long lived socket may cause memory leaks. I also saw a very high amount of allocated memory pointing to Timeout.timeout which prompted me to create this issue.

Without a more detailed example right now, the closest example to our setup is the publisher example. We initialize a connection, then rely on rescue to retrieve messages on a queue that will be sent out.

  • You say you use Stomp::client for 'all the messages you send'. Do you receive messages as well? If so, do the receives happen in the same process ? In short, where does the data to send come from?
  • Does your call to Stomp::client#publish supply a 'receipt listener' call back proc ?

We publish a message and wait for a receipt. The only messages received by this worker are receipts. The implementation is:

receipt = false
client.publish(destination, message, headers) { |r| receipt = r }
Thread.pass until receipt
  • What STOMP protocol level are you using? If you are using 1.1 or 1.2 do you use heartbeats? If you use heartbeats, what are the negotiated values after connect ?
  • Which style of connect parameters are you using? Show me an example please.

Here is an example of our connection parameters

Stomp::Client.new(:connect_headers=>{"accept-version"=>"1.1", "host"=>"host", "heart-beat"=>"15000,20000"},
 :hosts=>[{:host=>"host", :port=>port, :login=>"login", :passcode=>"passcode"}, {:host=>"host", :port=>port, :login=>"login", :passcode=>"passcode"}],
 :reliable=>true,
 :randomize=>false,
 :closed_check=>true,
 :initial_reconnect_delay=>1,
 :max_reconnect_delay=>30,
 :max_reconnect_attempts=>0,
 :max_hbread_fails=>2,
 :max_hbrlck_fails=>2,
 :fast_hbs_adjust=>0.1,
 :use_exponential_back_off=>false,
 :connect_timeout=>5,
 :start_timeout=>nil,
 :logger=> #<StompLogger>)
  • Can you show me a brief code sequence of exactly how you are using the memory_profiler gem?

I had to open up resque and wrap a chunk of their work code in memory_profiler. I will work on getting a better example.

  • What version of ActiveMQ is being used?

5.13.3

  • How many message per hour/minute/second are sent ? Same question about anoy received messages.

Average number of messages sent in a week is 3.4 million. Which comes to about 20,000 per hour, 337 per minute, and 5.6 per second. During busy times we can peak closer to 40,000 messages per hour.

  • How large are the messages (approximately) ?

This may take some digging to get a better number. Without using a networking tool to get the exact size, the contents of one of our messages is about 4K of text. It is a json message with about 196 fields.

I hope that this helps. I will work on putting a generic version of this code to post.

gmallard commented 8 years ago

Very good information, all that will be helpful, I am sure. Thanks much.

You say:

The only messages received by this worker are receipts.

How does the worker get data to send ? If you can supply a little more technical detail I would appreciate that. I am thinking: file? dbus? zmq? .... how?

You also say:

I had to open up resque and wrap a chunk of their work code in memory_profiler. I will work on getting a better example.

And later:

I will work on putting a generic version of this code to post.

A very simple example of the memory_profiler wrapping would be very good. I think I need that.

I also need to think about your use of resque. That might (or might not) change what I need to be thinking about regarding this issue.

And last but not least, thanks for the detailed connect parameters. That alone has given me several items to think about.

Thanks, Guy

gmallard commented 8 years ago

You will get tired of my questions. Is yours a Rails application?

rrosenblum commented 8 years ago

More questions are a good thing. I found out that the average message size is 15993 bytes (15.62K), the minimum message size is 1292 bytes, and the maximum message size is 67782 bytes.

Yes, this is a Rails application. We are running Ruby 2.2.4 with Rails 3.2.22.2 (working on upgrading to 4.2.6) and Resque 1.25.2

How does the worker get data to send ? If you can supply a little more technical detail I would appreciate that. I am thinking: file? dbus? zmq? .... how?

We get the data from Resque and Redis. In our application we will build a message, enqueue it onto a Redis queue and let Resque pick up the work for the worker. We did this to move direct communication with ActiveMQ out of our web app and web services.

I also need to think about your use of resque. That might (or might not) change what I need to be thinking about regarding this issue.

We are experiencing what I believe to be 2 issues.

I am putting together a couple of examples now. I am hoping to post them soon.

rrosenblum commented 8 years ago

Here is a gist of a generic version of our code. The code will not run exactly as is, but I think it is close enough to see what is going on.

Here is a modified version of the publisher example that uses memory profiler and a similar set up to our code. I think that this would be able to show the same issues.

Let me know if you have any other questions.

gmallard commented 8 years ago

Thanks much. I have not looked at those yet ..... but I will.

gmallard commented 8 years ago

So ...... a brief update.

I set up a test bed for this. And it works at this point.

The good news is that I can recreate the effect that you are seeing, that is the memory profiler reports extremely high numbers for file timeout.rb.

I went a bit overboard and have tested with both memory profiler gems (the underscore one which you are using, and the one with a dash in the name).

I want very much to understand why I am seeing what I see. Each call to Timeout::timeout has a specific purpose in the gem. Consider:

  1. One call specifically for the read after a 'CONNECT' . A one time use. Reading a CONNECTED or ERROR frame off the wire.
  2. One call covering reading and parsing a frame on the inbound wire. Inbound messages would be one of: MESSAGE, ERROR, RECEIPT. This code frequently called by applications that use SUBSCRIBE.
  3. One call covering the actual socket open for normal TCP connection. One time use.
  4. One other call covering the actual socket opens for an SSL environment. One time use.
  5. One call specifically covering the connect sequence in Client#initialize. One time use.

There are a couple of other calls in the unit tests, under the test directory. I am ignoring those.

I know these are the only calls because of:

$ pwd
..../stompgem
$ find . -name '*.rb' -exec grep -Hi "Timeout::timeout" {} \;
./test/test_client.rb:    Timeout::timeout(4) do
./test/test_client.rb:    Timeout::timeout(4) do
./test/test_client.rb:    Timeout::timeout(4) do
./test/test_client.rb:    Timeout::timeout(4) do
./test/test_client.rb:        Timeout::timeout(4) do
./lib/stomp/client.rb:        Timeout::timeout(@start_timeout) {
./lib/connection/netio.rb:            Timeout::timeout(@connread_timeout, Stomp::Error::ConnectReadTimeout) do
./lib/connection/netio.rb:        Timeout::timeout(@parse_timeout, Stomp::Error::PacketParsingTimeout) do
./lib/connection/netio.rb:      Timeout::timeout(@connect_timeout, Stomp::Error::SocketOpenTimeout) do
./lib/connection/netio.rb:        Timeout::timeout(@connect_timeout, Stomp::Error::SocketOpenTimeout) do

This leads me to think that the extreme timeout memory use must be triggered by ..... the 'receipt listener' logic is a guess at this point. I want to investigate some more, and report again later.

gmallard commented 8 years ago

Just to note that the memory balloon is confirmed.

I can recreate using a Client#publish, and supplying a receipt listener. If no receipt listener is used Timeout memory is much smaller.

I wanted to rule out the threading logic in Client. And I did that using Connection#publish. If the #publish headers ask for a receipt, and an immediate call to Connection#receive is used to retrieve the RECEIPT frame, memory use increases considerably. Timeout is at the top again.

At this point I will consider exactly what would need to change to eliminate the memory explosion.

I want to consider whether there are alternatives to the idea behind socket timeouts.

rrosenblum commented 8 years ago

I am glad that you were able to reproduce the bloated memory issue, and that it wasn't just something weird in our setup. I would be glad to help code some solutions to this. My only reservation is that I don't feel as though I know the ActiveMQ protocol well enough to know that a change that I make won't have some other side effect.

I went a bit overboard and have tested with both memory profiler gems (the underscore one which you are using, and the one with a dash in the name).

Always good to have a second opinion.

This leads me to think that the extreme timeout memory use must be triggered by ..... the 'receipt listener' logic is a guess at this point.

I agree with this. It seems to be the only recurring call to timeout. Modifying the others would be nice, but it won't have the same impact on memory since they are only called once.

I want to consider whether there are alternatives to the idea behind socket timeouts.

What are your reservations to using socket timeouts? Are there any other ideas that you have had? I was thinking that socket timeouts were the way to go to fix this since it is implemented in a lower level protocol. There should be settings for different operations allowing separate values for connect, read, write, and anything else that might be needed.

One call covering reading and parsing a frame on the inbound wire. Inbound messages would be one of: MESSAGE, ERROR, RECEIPT. This code frequently called by applications that use SUBSCRIBE.

Looking at the code, these all appear to use the same timeout value which may help complexity of a change.

gmallard commented 8 years ago

Regarding help with the coding ..... please fire away if you have the bandwidth. Send me a proposal, and we can talk about it. Push a test branch to your git hub clone of the gem. I can fetch it from there.

Regarding reservations .....

At a 10K foot level, I need to be concerned about impacts to existing gem users. The gem needs to function on:

On a more detailed level:

I will also mention the gem needs to be robust with:

I almost always test against those four brokers (Artemis being fairly recent). One can write spec tests forever ..... and still not know how client code will function in front of a running server.

I am sure there are other factors I should consider, but just have not thought about yet. I will document those as they occur to me.

My approach is going to be:

Alternatives: One at least is that I was thinking that what I really want is a semaphore/lock implementation with a timeout mechanism.

gmallard commented 8 years ago

A short note to confirm that the excessive memory use by timeout is also present on:

gmallard commented 8 years ago

This is to document the expectations for a solution to this issue. I show the kind of results I would like to see.

Two different scenarios are shown below. Both scenarios use:

Scenario 1: the gem as it is today. The beginning of the profiler report is here.

Total allocated: 57422354 bytes (8897 objects)
Total retained:  68295 bytes (5 objects)

allocated memory by gem
-----------------------------------
  53585904  r240_000_7a622b1/lib
   3724733  stompgem/lib
    111717  other

allocated memory by file
-----------------------------------
  53585904  /opt/r240_000_7a622b1/lib/ruby/2.4.0/timeout.rb
   2138242  /ad3/gma/stomp-repos/stompgem/lib/connection/netio.rb
   1057136  /ad3/gma/stomp-repos/stompgem/lib/client/utils.rb
    204369  /ad3/gma/stomp-repos/stompgem/lib/connection/utils.rb
     92800  /ad3/gma/stomp-repos/stompgem/lib/stomp/codec.rb
     73280  /ad3/gma/stomp-repos/stompgem/lib/stomp/connection.rb
     69903  /ad3/gma/stomp-repos/stompgem/adhoc/payload_generator.rb
     68900  /ad3/gma/stomp-repos/stompgem/lib/stomp/message.rb
     53082  /ad3/gma/stomp-repos/stompgem/lib/connection/utf8.rb
     40974  issue121_03.rb
     22480  /ad3/gma/stomp-repos/stompgem/lib/stomp/client.rb
     14444  /ad3/gma/stomp-repos/stompgem/lib/stomp/ext/hash.rb
       840  /ad3/gma/stomp-repos/stompgem/adhoc/stomp_adhoc_common.rb

Scenario 2. Here the calls to Timeout::timeout in the _receive method (netio.rb) were just commented out. No other changes were made. The beginning of the profiler report follows.

Total allocated: 3552187 bytes (8336 objects)
Total retained:  68295 bytes (5 objects)

allocated memory by gem
-----------------------------------
   3440470  stompgem/lib
    111717  other

allocated memory by file
-----------------------------------
   1853979  /ad3/gma/stomp-repos/stompgem/lib/connection/netio.rb
   1057136  /ad3/gma/stomp-repos/stompgem/lib/client/utils.rb
    204369  /ad3/gma/stomp-repos/stompgem/lib/connection/utils.rb
     92800  /ad3/gma/stomp-repos/stompgem/lib/stomp/codec.rb
     73280  /ad3/gma/stomp-repos/stompgem/lib/stomp/connection.rb
     69903  /ad3/gma/stomp-repos/stompgem/adhoc/payload_generator.rb
     68900  /ad3/gma/stomp-repos/stompgem/lib/stomp/message.rb
     53082  /ad3/gma/stomp-repos/stompgem/lib/connection/utf8.rb
     40974  issue121_03.rb
     22480  /ad3/gma/stomp-repos/stompgem/lib/stomp/client.rb
     14444  /ad3/gma/stomp-repos/stompgem/lib/stomp/ext/hash.rb
       840  /ad3/gma/stomp-repos/stompgem/adhoc/stomp_adhoc_common.rb

The timeout.rb file supplied by Ruby does not appear on the second report at all.

gmallard commented 8 years ago

@rrosenblum

If I toss you a gem over the wall in a few days, are you in a position to smoke test it ?

rrosenblum commented 8 years ago

Yes, I can make some time to smoke test a new version of the gem.

gmallard commented 8 years ago

Is e-mail attachment OK? The address you have here on github?

File name: stomp-1.4.0.gem File size: 87,552 bytes SHA1: 144882ff2a2a6e5a3c0a18b84099e556e123524b

Points:

Let me know about e-mail delivery please.

I have given it a pretty good shakedown on Linux and OSX.

I still have some todo's, so a new gem release won't be immediate, but reasonably soon. Partly depending on your results of course. My todo's are:

gmallard commented 8 years ago

On a more technical note: the solution being put together is based on IO::select.

The socket timeout route (SO_RCVTIMEO) will not work. Why? The receive and send timeout functionality is broken, and has been since the early 1.9 versions. Apparently when the Ruby developers switched socket handling to epoll.

rrosenblum commented 8 years ago

An email attachment will work great for me.

That is interesting about the broken implementation of socket timeout. It doesn't surprise me too much though.

rrosenblum commented 8 years ago

Thanks for sending me a copy of the gem. I am seeing a huge improvement on the amount of allocated memory. The new version of the gem allocates 1/3 of the memory. I have run a couple thousand messages through the connection and all of the messages were successfully sent. I plan on wiring up a stress for sending and receiving messages.

Memory stats based on 1000 messages and the setup that I mentioned before. These are running using Ruby 2.2.4 on Red Hat 6.

stomp 1.3.5

Total allocated: 1577153700 bytes (3522085 objects)
Total retained:  11640505 bytes (31463 objects)

allocated memory by gem
-----------------------------------
1054846382  2.2.4/lib
 201115847  resque-1.25.2
  96997322  redis-3.2.2

1052951952  /usr/local/rbenv/versions/2.2.4/lib/ruby/2.2.0/timeout.rb

stomp 1.4.0

Total allocated: 514678096 bytes (3368982 objects)
Total retained:  11643354 bytes (31493 objects)

allocated memory by gem
-----------------------------------
 201115814  resque-1.25.2
  96870178  redis-3.2.2

1050904  /usr/local/rbenv/versions/2.2.4/lib/ruby/2.2.0/timeout.rb

I have a few commits that I will make a PR for shortly that should help with testing (wiring up Travis CI and Code Climate).

gmallard commented 8 years ago

I will push the 1.4.0 version out ..... as soon as I have resolved some difficulties with jruby.

As usual, jruby behavior is not consistent with ruby.

And not even consistent across jruby versions.

rrosenblum commented 8 years ago

That is really annoying about jruby. What kind of inconsistencies are you running into?

gmallard commented 8 years ago

The primary difficulty is determining whether (or not) a socket has data to read. In normal Ruby, using:

opensocket.ready?

gives that information.

Jruby is ..... well different. Some versions return true / false for that as you would wish.

But some versions return a Fixnum (supposedly an indication of how much data there is to read)

Additionally calls to .ready? can be inconsistent. Meaning call it once, it returns a Jruby version of true. Call it again immediately, and it returns a Jruby false.

Much of how to deal with the implementation is ..... trial and error.

The result of that is I tend to ignore it for long periods of time. With the magnitude of this change I felt I needed to review / test with a couple of different versions, making changes according to results.

The last time I really tested was with a 1.6 version. Current versions are 1.7.x (corresponding to Ruby 1.9.x) and 9.x (corresponding to Ruby 2.x). They are different than 1.6, and different from each other.

Look at the checks for @jruby in netio.rb to get a flavor. And there are more to come, this time around.

I am close, but not quite there yet.

gmallard commented 8 years ago

OK, version 1.4.0 is available on rubygems.

Reference: 78446f5

Install in the usual way.

One and all are urged to test this release carefully.

rrosenblum commented 8 years ago

Awesome. After reading your first message, I was worried that it was going to take significantly longer to get the build working with JRuby.

gmallard commented 8 years ago

So was I .... and I make no guarantees other that the specific releases I used during testing.

gmallard commented 8 years ago

Given that no serious issues have resulted from this change, I am closing this issue.

If you have concerns with this, please document them.

rrosenblum commented 8 years ago

Sounds good to me. Thanks again for your help on this.