honeycombio / libhoney-rb

Ruby library for sending data to Honeycomb
Apache License 2.0
11 stars 30 forks source link

Memory issue with ruby Timeout module #73

Closed davidhu2000 closed 3 years ago

davidhu2000 commented 3 years ago

We were debugging a memory problem in one of our applications, the memory usage was going up by around 2GB in 6 hours. To help debug, we used the ObjectSpace to see where the memory is used.

ObjectSpace.trace_object_allocations_start
... bunch of code ran
ObjectSpace.trace_object_allocations_stop
ObjectSpace.dump_all
heapy read /heap_dump.json

Here is the result of our analysis (for about 2 hours of data collection)

Analyzing Heap (Generation: all)
-------------------------------

allocated by memory (1821775615) (in bytes)
==============================
  1681971840  /usr/local/lib/ruby/2.4.0/timeout.rb:87
    13832160  /usr/local/bundle/gems/concurrent-ruby-1.0.5/lib/concurrent/collection/map/mri_map_backend.rb:14
    12853416  /usr/local/bundle/gems/concurrent-ruby-1.0.5/lib/concurrent/collection/map/non_concurrent_map_backend.rb:16
     4217984  /usr/local/bundle/gems/actionview-5.1.4/lib/action_view/template.rb:307

The identified issue is this line in the ruby module

        y = Thread.start {

After some research, looks to be happening in other places as well. Example. On more research, seems like ruby Timeout is considered somewhat dangerous (here and here).

I took a quick look around this repo, and didn't see any specific uses of Timeout, but timeout is used for Net::HTTP

We then monkey-patched the timeout module and logged the caller to figure out where the massive number of timeout calls is coming here.

module Timeout
  def timeout(sec, klass = nil, message = nil)
    Rails.logger.error("Timeout called by #{caller[0..4].join(" | ") }")
...

What we found is we are getting around 185k calls per minute between 4 servers coming from Honeycomb. Here are the first few stack traces from our log.

Timeout called by 
/usr/local/bundle/gems/libhoney-1.13.0/lib/libhoney/transmission.rb:139:in `block in batch_loop' 
/usr/local/bundle/gems/libhoney-1.13.0/lib/libhoney/transmission.rb:137:in `loop' 
/usr/local/bundle/gems/libhoney-1.13.0/lib/libhoney/transmission.rb:137:in `batch_loop' 
/usr/local/bundle/gems/libhoney-1.13.0/lib/libhoney/transmission.rb:211:in `block (2 levels) in ensure_threads_running'

We turned off Honeycomb and measured memory, and it was significantly less memory used in the timeout module.

I want to open this issue to see what the team thinks of this? And whether there are alternatives that can help this issue?

Versions of stuff we use

Rails (5.1.4)
Ruby (2.4.9)
Puma (5.0.4)
honeycomb-beeline (2.3.0)
  libhoney (~> 1.14, >= 1.14.2)
martin308 commented 3 years ago

@davidhu2000 thanks for the detailed write up! We do indeed have one usage of Timeout in libhoney. Definitely well aware of the pitfalls of Timeout. We only use it to wrap a single atomic operation so shouldn't hit any of the edge cases that can be caused by the usage of Timeout. I would expect a lot of calls to timeout as we use it to wait on the queue that batches up events to be sent to Honeycomb. I'm not sure on the usage of Timeout in the http library but it looks like you are not seeing anything with that in the backtrace?

The memory being used by libhoney should only be the events that are stored in memory waiting to be sent. Are you able to see some kind of memory leak here or just an increase in memory usage? When you say "significantly less memory used" do you have numbers for this?

davidhu2000 commented 3 years ago

Looks like we have two separate traces (we added more traces to see. the first one looks to be when we send a trace to honeycomb. The 2nd one is just the polling?

/usr/local/bundle/gems/http-4.4.1/lib/http/timeout/per_operation.rb:23:in `connect'
/usr/local/bundle/gems/http-4.4.1/lib/http/connection.rb:43:in `initialize'
/usr/local/bundle/gems/http-4.4.1/lib/http/client.rb:71:in `new'
/usr/local/bundle/gems/http-4.4.1/lib/http/client.rb:71:in `perform'
/usr/local/bundle/gems/http-4.4.1/lib/http/client.rb:31:in `request'
/usr/local/bundle/gems/http-4.4.1/lib/http/chainable.rb:27:in `post'
/usr/local/bundle/gems/libhoney-1.16.0/lib/libhoney/transmission.rb:95:in `block in send_loop'
/usr/local/bundle/gems/libhoney-1.16.0/lib/libhoney/transmission.rb:78:in `loop'
/usr/local/bundle/gems/libhoney-1.16.0/lib/libhoney/transmission.rb:78:in `send_loop'
/usr/local/bundle/gems/libhoney-1.16.0/lib/libhoney/transmission.rb:242:in `block (2 levels) in ensure_threads_running'

and

/usr/local/bundle/gems/libhoney-1.16.0/lib/libhoney/transmission.rb:161:in `block (2 levels) in batch_loop'
/usr/local/bundle/gems/libhoney-1.16.0/lib/libhoney/transmission.rb:160:in `handle_interrupt'
/usr/local/bundle/gems/libhoney-1.16.0/lib/libhoney/transmission.rb:160:in `block in batch_loop'
/usr/local/bundle/gems/libhoney-1.16.0/lib/libhoney/transmission.rb:158:in `loop'
/usr/local/bundle/gems/libhoney-1.16.0/lib/libhoney/transmission.rb:158:in `batch_loop'
/usr/local/bundle/gems/libhoney-1.16.0/lib/libhoney/transmission.rb:240:in `block (2 levels) in ensure_threads_running'

In terms of the # of calls. The majority of calls is #2, which is around 500 per min per server.

Looking at the code, this looks to be the block with the issue. I think the just using the timeout itself might be causing the bloated memory issues?

          Thread.handle_interrupt(Timeout::Error => :on_blocking) do
            while (event = Timeout.timeout(@send_frequency) { @batch_queue.pop })
              key = [event.api_host, event.writekey, event.dataset]
              batched_events[key] << event
            end
          end

We have also seen use of Thread causing some memory increases as well.

davidhu2000 commented 3 years ago

I would expect a lot of calls to timeout as we use it to wait on the queue that batches up events to be sent to Honeycomb. I'm not sure on the usage of Timeout in the http library but it looks like you are not seeing anything with that in the backtrace?

It's not strictly Net::HTTP related. it's just using the Timeout module. Since there is usage in this repo, that's probably the reason. The logs we added just tells us who is calling the Timeout module.

The memory being used by libhoney should only be the events that are stored in memory waiting to be sent.

I think the fact this library is using Timeout is causing this memory leak for us.

Are you able to see some kind of memory leak here or just an increase in memory usage?

The memory continue to increase throughout the lifetime of the application, we definitely think it's a leak. If it's not a leak, we'd expect the memory to plateau.

When you say "significantly less memory used" do you have numbers for this?

Unfortunately I don't. I can try to pull some numbers, it'll take a while to get an accurate measurement thought. If i remember correctly, it's about half.

davidhu2000 commented 3 years ago

we did a few sample tests with honeycomb on vs off. Basically, it's a sidekiq worker that rans some task that takes 1s to 15s. Here are the results

Our code is basically something like

ObjectSpace.trace_object_allocations_start
...perform some tasks (specifically open a file and analyze it)
ObjectSpace.trace_object_allocations_stop
dump = ObjectSpace.dump_all
logger.info "Heap dump at #{dump.path}"

Then we use heapy to analyze.

heapy read <file-path> all | grep timeout.rb

With honeycomb (in bytes)

24,165,161
68,244,800
37,797,120
83,994,072
83,993,600

Without honeycomb (in bytes)

 4,199,776
 3,149,760
 2,099,840
 2,099,840
 5,249,600

The only change between the two tests are just whether we have honeycomb installed.

As you can see, taking off honeycomb reduces the memory usage of the Timeout module by 10x-ish

robbkidd commented 3 years ago

After spending some more quality time with a Beeline'd Rails app, I can confirm the overall assessment here: using Ruby core's Timeout causes significant memory consumption.

The results of running TEST_COUNT=10 bundle exec derailed exec perf:objects on the current-as-of-this-writing example Rails 5.2 app:

retained memory by location
-----------------------------------
   1049320  /ruby/2.5.8/lib/ruby/2.5.0/timeout.rb:83
     32640  /libhoney-1.18.0/lib/libhoney/event.rb:56
...

retained memory by class
-----------------------------------
   1049160  Thread
     85720  Hash
...

Line 83 of Ruby 2.5.8's Timeout assigns a Thread.start to a local variable which agrees with the Thread class being reported as the worst offender for hanging around.

robbkidd commented 3 years ago

libhoney's use of Timeout is a workaround for Ruby's Queue and SizedQueue classes not providing a timeout for popping objects off the queue.

Possible Solution

Implement our own SizedQueueWithTimeout (working title) that would look very much like Avdi Grimm's Tapas::Queue (i.e. lifted wholesale with thanks and praise and credit).

PROS:

CONS:

lewisf commented 3 years ago

@robbkidd wondering if there's a timeline for a fix or if there is a workaround?

our sidekiq workers continue to hit our kubernetes memory limits and crash and restart -- which leads to a number of jobs occasionally crashing.

not sure if helpful, but we're also seeing is SEGFAULTs reported in our logs when ^ happens.

robbkidd commented 3 years ago

@lewisf We've got several things in other integrations ahead of this in our backlog. We don't have an estimate at the moment for when this issue will be addressed.

lewisf commented 3 years ago

@robbkidd i hate to ping again (i understand y'all are busy and appreciate all the work you're doing) but we recently had to remove honeycomb from production because the memory leaks were causing our processes to die fairly consistently (getting OOM killed) leading to other instability issues.

ajvondrak commented 3 years ago

@lewisf I just talked it over with some folks. As the current thinking stands, we'd have to change the libhoney guts around, which is (a) invasive enough that workarounds are hard to come by and (b) risky enough that patches are high effort and not likely to be finished soon. 😞

How reliant is your code on the Beeline's specific auto-instrumentation? I ask because I poked around the opentelemetry-ruby code base, and I don't actually spot any usage of the timeout module (although there's one spurious require). Their batch span processor appears to use ConditionVariable#wait instead of Timeout.timeout, which is what the hand-rolled queues out there seem to do. Theoretically, this wouldn't split off the new threads that seem to be chewing through memory. So I'm thinking OTel might not be susceptible to the same memory leak issue.

We might use a similar approach in libhoney, but again it's going to take some effort to implement & test. Meanwhile, Honeycomb now accepts OTLP natively, so in theory you could avail of the existing Timeout-free code by porting your use of the Beeline over to OTel. I understand that that changes a lot of variables, though: the auto-instrumentation is going to look different, it's a new library to incorporate, etc. But just thought I'd throw that out there as a possible alternative that doesn't block on Honeycomb engineering. 🙂

robbkidd commented 3 years ago

A caveat to the opentelemetry-ruby route: OTel Ruby speaks only OTLP-via-HTTP and Honeycomb currently accepts only OTLP-via-GRPC. Using OTel Ruby to send to Honeycomb would require running an OpenTelemetry Collector to receive the OTLP/http and translate traces into OTLP/grpc.

@ajvondrak and I stared contemplatively at this code yesterday. We're investigating an experimental queue replacement.

lewisf commented 3 years ago

Thank you for the update!

We do use the provided Beeline auto-instrumentation. Will read up on OLTP to see how it works and to get a sense of if we can spin up the necessary effort in the near term.

The experimental queue replacement sounds interesting ... definitely keep us posted.

robbkidd commented 3 years ago

The experiment (#87) is up!

⚠️ Beware of Leopard ⚠️

MikeGoldsmith commented 3 years ago

A caveat to the opentelemetry-ruby route: OTel Ruby speaks only OTLP-via-HTTP and Honeycomb currently accepts only OTLP-via-GRPC. Using OTel Ruby to send to Honeycomb would require running an OpenTelemetry Collector to receive the OTLP/http and translate traces into OTLP/grpc.

@ajvondrak and I stared contemplatively at this code yesterday. We're investigating an experimental queue replacement.

The alternative transmission is now available #87 has been merged and we also now support OTLP/HTTP at the Honeycomb API so the OpenTelemetry SDK can be used in place of the Beeline if you'd prefer to use that.

Closing this as there are now two alternatives. Please reopen if you think it's still relevant.