ruby-shoryuken / shoryuken

A super efficient Amazon SQS thread based message processor for Ruby. This project is in MAINTENANCE MODE. Reach me out on Slack link on the description if you want to become a new maintainer.
Other
2.05k stars 280 forks source link

Potential Memory leak #558

Closed connormurray7 closed 2 years ago

connormurray7 commented 5 years ago

We just moved a large workload over to a pool of Shoryuken workers from Resque and have been noticing a memory leak (the same workload did not exhibit the same memory growth behavior in Resque).

This is our first use of Shoryuken, so we don't have another version to compare it to.

From Gemfile:

  gem 'aws-sdk-s3', '~> 1.8'
  gem 'aws-sdk-sqs', '~> 1.8'
  gem 'shoryuken', '~> 3.0'

From Gemfile.lock:

    aws-sdk-core (3.48.3)
      aws-eventstream (~> 1.0, >= 1.0.2)
      aws-partitions (~> 1.0)
      aws-sigv4 (~> 1.1)
      jmespath (~> 1.0)
    aws-sdk-s3 (1.22.0)
      aws-sdk-core (~> 3, >= 3.26.0)
      aws-sdk-kms (~> 1)
      aws-sigv4 (~> 1.0)
    aws-sdk-sqs (1.9.0)
      aws-sdk-core (~> 3, >= 3.26.0)
      aws-sigv4 (~> 1.0)
    shoryuken (3.3.1)
      aws-sdk-core (>= 2)
      concurrent-ruby
      thor

config:

aws:
  receive_message:
    attribute_names:
    - ApproximateReceiveCount
    - SentTimestamp
concurrency: 1

Screen Shot 2019-04-02 at 11 15 27 AM

The above chart shows a per container memory usage over time (overlaid is an instance with the min, max, and the mean memory usage). All of the containers are cycled, and then you can see that they steadily grow over time. There are 10 processes in one container. Each process grows to use about 2 GB of memory before the container OOMs. I spent some time looking at heap dumps and it shows that there is a significant growth of the following items over time:

/usr/local/bundle/gems/aws-sdk-core-3.34.0/lib/seahorse/util.rb:9:STRING /usr/local/bundle/gems/aws-sdk-core-3.34.0/lib/seahorse/client/handler_list_entry.rb:71:HASH

Updating the aws-sdk-sqs version to 1.13 did not fix any issues, those two items still show up as culprits for leakage.

Any suggestions or further questions would be greatly appreciated, thanks!

phstc commented 5 years ago

@connormurray7 are all your processes using concurrency 1? If your workers are thread-safe, I would suggest to use less processes and more concurrency.

Could you share your full shoryuken.yml? The aws: key is no longer supported.

connormurray7 commented 5 years ago

@phstc yes all of them are using concurrency 1. We have some clients that are not thread safe so we are intentionally using more processes and no concurrency. The only other part of our .yml file is round robin queue weightings:

queues:
 - [queueOne, 1]
 - [queueTwo, 2]
 - [queueThree, 4]
phstc commented 5 years ago

@connormurray7 is this problem easily reproducible? I'm wondering if the memory leak is related to a specific worker/work load.

Have you tried some combinations, for example reducing the number of processes per container?

connormurray7 commented 5 years ago

@phstc It is easily reproducible for us. The number of processes doesn't matter within the container, each process starts with 100MB and grows to 2GB overtime. I tried this with 1, 5, 10 processes.

From heap dumps, it has something to do with the seahorse client hashes and strings that are not getting cleaned up. I mentioned these classes above:

/usr/local/bundle/gems/aws-sdk-core-3.34.0/lib/seahorse/util.rb:9:STRING
/usr/local/bundle/gems/aws-sdk-core-3.34.0/lib/seahorse/client/handler_list_entry.rb:71:HASH

I originally suspected that there was some memory leak from within the aws-sdk-core code, but given that when we run the exact same jobs in resque without the memory leaks, I don't think it is something from that library.

If you have other tools or suggestions for debugging this issue I don't mind spending time digging further.

Lewiscowles1986 commented 5 years ago

Today at 12:00 BST (UTC+1) an application started failing to deploy due to OOM. It's a shoryuken async task queue (very boilerplate). We added 50% more resources, it still has not started. Do you think this could be related? (Local dev it's not too bad, maybe we are being stingy with RAM)

Lewiscowles1986 commented 5 years ago

From my comment it's actually looking to be webpacker in cloudfoundry. We are currently not testing our application for memory leaks, unbounded queries so I don't think this is related to this issue

phstc commented 5 years ago

@connormurray7 since it is easily reproducible, I would try some simple tests, for example, using an empty worker then slowly adding your real worker logic. I got mislead by heap dumps before.

I originally suspected that there was some memory leak from within the aws-sdk-core code, but given that when we run the exact same jobs in resque without the memory leaks, I don't think it is something from that library.

  1. How long (AVG) do your jobs take to process?
  2. What's your Ruby version?

@Lewiscowles1986 Most of the memory issues I've seen with Shoryuken (and other thread-based libs) usually happen over time (memory leak) not during deploys/boot.

phstc commented 5 years ago

From heap dumps, it has something to do with the seahorse client hashes and strings that are not getting cleaned up. I mentioned these classes above:

@connormurray7 Shoryuken memoizes sqs_client. I'm wondering if that may have any relation with not cleaning up hashes and strings. You can try to remove the memozation for testing. I would be surprised if that's the case, since it is being working like that for a while.

nfedyashev commented 4 years ago

@phstc

Your comment about sqs_client configuration got me into thinking that perhaps I've configured it incorrectly and that could be caused what looks like a memory leak.

Screenshot 2020-02-02 at 07 12 40

It starts with ~70MB RAM and then slowly growing 10X+ of that amount over the period of 24hrs until Heroku does daily Dyno restart cycle.

% cat config/initializers/shoryuken.rb
Shoryuken.options[:concurrency] = 1
Shoryuken.options[:delay] = 25
Shoryuken.options[:queues] = [
  ENV.fetch('SQS_VOTES_AND_COMMENTS_URL')
]

Shoryuken.configure_client do |config|
  config.sqs_client = Aws::SQS::Client.new(
    region: ENV["AWS_REGION"],
    access_key_id: ENV["AWS_ACCESS_KEY_ID"],
    secret_access_key: ENV["AWS_SECRET_ACCESS_KEY"],
  )
end

Shoryuken.configure_server do |config|
  config.sqs_client = Aws::SQS::Client.new(
    region: ENV["AWS_REGION"],
    access_key_id: ENV["AWS_ACCESS_KEY_ID"],
    secret_access_key: ENV["AWS_SECRET_ACCESS_KEY"],
  )
end
%cat Procfile
worker: bundle exec shoryuken -R

Pablo, would you mind giving me any hint on what I might need try to handle this issue?

phstc commented 4 years ago

Hi @nfedyashev

There's a possibility that reusing sqs_client is building up the memory consumption.

If your issue is easily reproducible, could you try to add the patch below in your config/initializers/shoryuken.rb to always use a new instance of the SQS client?

# config/initializers/shoryuken.rb
# ...
module Shoryuken
  class Options
    class << self
      def sqs_client
        Aws::SQS::Client.new(
          region: ENV["AWS_REGION"],
          access_key_id: ENV["AWS_ACCESS_KEY_ID"],
          secret_access_key: ENV["AWS_SECRET_ACCESS_KEY"]
        )
      end
    end
  end
end

This may cause some issues with FIFO queues, please, let me know if you are using FIFO, so I can review the patch.

nfedyashev commented 4 years ago

Hi @phstc !

This may cause some issues with FIFO queues

I'm using Standard queues in this project.

Thanks for the patch! I've just deployed it. So far so good. I'll keep you updated.

nfedyashev commented 4 years ago

@phstc the diff that I deployed ~24 hrs ago was:

--- a/config/initializers/shoryuken.rb
+++ b/config/initializers/shoryuken.rb
-Shoryuken.configure_client do |config|
-  config.sqs_client = Aws::SQS::Client.new(
-    region: ENV["AWS_REGION"],
-    access_key_id: ENV["AWS_ACCESS_KEY_ID"],
-    secret_access_key: ENV["AWS_SECRET_ACCESS_KEY"],
-    #verify_checksums: false
-  )
+
+#https://github.com/phstc/shoryuken/issues/558#issuecomment-581167683
+module Shoryuken
+  class Options
+    class << self
+      def sqs_client
+        Aws::SQS::Client.new(
+          region: ENV["AWS_REGION"],
+          access_key_id: ENV["AWS_ACCESS_KEY_ID"],
+          secret_access_key: ENV["AWS_SECRET_ACCESS_KEY"]
+        )
+      end
+    end
+  end
+ end
- Shoryuken.configure_server do |config|
-  config.sqs_client = Aws::SQS::Client.new(
-    region: ENV["AWS_REGION"],
-    access_key_id: ENV["AWS_ACCESS_KEY_ID"],
-    secret_access_key: ENV["AWS_SECRET_ACCESS_KEY"],
-    #verify_checksums: false
-  )

Screenshot 2020-02-04 at 02 16 00

Looks like it didn't make much difference, similar pattern but perhaps 5% lower RAM consumption that before this update.

nfedyashev commented 4 years ago

@phstc I've tried something else. Removed unnecessary require in configure_server block:

Shoryuken.configure_server do |config|

  • require Rails.root.join("app/jobs/vote_worker")

and memory consumption has reduced pretty noticeably(556MB->470MB max): Screenshot 2020-02-05 at 09 20 58

HTH

phstc commented 4 years ago

Hey @nfedyashev

Thanks for testing the SQS client hypothesis, so it is confirmed, there's no memory issue on re-using the client instance for an extended period.

I didn't see the require in your first snippet.

Shoryuken.configure_server should be called once at the application startup. I don't know if having a require inside or outside would make much difference since it is called once.

Most of the memory issues I've seen so far are more related to the worker implementation rather than with Shoryuken. One hypothesis you can try is also a NOP worker, deploy an empty Shoryuken worker, and try to reproduce it, so you can confirm if the problem is with the worker implementation.

What's your worker doing? Some Ruby Gems can keep some state in memory, and that can build up over time.

nfedyashev commented 4 years ago

@phstc

I didn't see the require in your first snippet.

right, sorry. I thought it was irrelevant to the question.

Thanks for the idea with NOP worker. I'll give it a try a bit later.

What's your worker doing? Some Ruby Gems can keep some state in memory, and that can build up over time. What's your worker doing?

Parse JSON payload, perform 5-10 SQL queries(via ActiveRecord), notify via Aws::CloudWatch::Client, send ActionCable notification.

Thanks

bestborn14 commented 3 years ago

@nfedyashev I am also facing the same issue.

nfedyashev commented 3 years ago

@bestborn14 sorry can't help you much with this one. It's no longer in production and I don't remember exact details other than what is written in this thread.

github-actions[bot] commented 2 years ago

This issue is now marked as stale because it hasn't seen activity for a while. Add a comment or it will be closed soon.

github-actions[bot] commented 2 years ago

This issue was closed because it hasn't seen activity for a while.