Closed connormurray7 closed 2 years ago
@connormurray7 are all your processes using concurrency 1? If your workers are thread-safe, I would suggest to use less processes and more concurrency.
Could you share your full shoryuken.yml
? The aws:
key is no longer supported.
@phstc yes all of them are using concurrency 1. We have some clients that are not thread safe so we are intentionally using more processes and no concurrency. The only other part of our .yml
file is round robin queue weightings:
queues:
- [queueOne, 1]
- [queueTwo, 2]
- [queueThree, 4]
@connormurray7 is this problem easily reproducible? I'm wondering if the memory leak is related to a specific worker/work load.
Have you tried some combinations, for example reducing the number of processes per container?
@phstc It is easily reproducible for us. The number of processes doesn't matter within the container, each process starts with 100MB and grows to 2GB overtime. I tried this with 1, 5, 10 processes.
From heap dumps, it has something to do with the seahorse client hashes and strings that are not getting cleaned up. I mentioned these classes above:
/usr/local/bundle/gems/aws-sdk-core-3.34.0/lib/seahorse/util.rb:9:STRING
/usr/local/bundle/gems/aws-sdk-core-3.34.0/lib/seahorse/client/handler_list_entry.rb:71:HASH
I originally suspected that there was some memory leak from within the aws-sdk-core code, but given that when we run the exact same jobs in resque without the memory leaks, I don't think it is something from that library.
If you have other tools or suggestions for debugging this issue I don't mind spending time digging further.
Today at 12:00 BST (UTC+1) an application started failing to deploy due to OOM. It's a shoryuken async task queue (very boilerplate). We added 50% more resources, it still has not started. Do you think this could be related? (Local dev it's not too bad, maybe we are being stingy with RAM)
From my comment it's actually looking to be webpacker in cloudfoundry. We are currently not testing our application for memory leaks, unbounded queries so I don't think this is related to this issue
@connormurray7 since it is easily reproducible, I would try some simple tests, for example, using an empty worker then slowly adding your real worker logic. I got mislead by heap dumps before.
I originally suspected that there was some memory leak from within the aws-sdk-core code, but given that when we run the exact same jobs in resque without the memory leaks, I don't think it is something from that library.
@Lewiscowles1986 Most of the memory issues I've seen with Shoryuken (and other thread-based libs) usually happen over time (memory leak) not during deploys/boot.
From heap dumps, it has something to do with the seahorse client hashes and strings that are not getting cleaned up. I mentioned these classes above:
@connormurray7 Shoryuken memoizes sqs_client
. I'm wondering if that may have any relation with not cleaning up hashes and strings. You can try to remove the memozation for testing. I would be surprised if that's the case, since it is being working like that for a while.
@phstc
Your comment about sqs_client configuration got me into thinking that perhaps I've configured it incorrectly and that could be caused what looks like a memory leak.
It starts with ~70MB RAM and then slowly growing 10X+ of that amount over the period of 24hrs until Heroku does daily Dyno restart cycle.
% cat config/initializers/shoryuken.rb
Shoryuken.options[:concurrency] = 1
Shoryuken.options[:delay] = 25
Shoryuken.options[:queues] = [
ENV.fetch('SQS_VOTES_AND_COMMENTS_URL')
]
Shoryuken.configure_client do |config|
config.sqs_client = Aws::SQS::Client.new(
region: ENV["AWS_REGION"],
access_key_id: ENV["AWS_ACCESS_KEY_ID"],
secret_access_key: ENV["AWS_SECRET_ACCESS_KEY"],
)
end
Shoryuken.configure_server do |config|
config.sqs_client = Aws::SQS::Client.new(
region: ENV["AWS_REGION"],
access_key_id: ENV["AWS_ACCESS_KEY_ID"],
secret_access_key: ENV["AWS_SECRET_ACCESS_KEY"],
)
end
%cat Procfile
worker: bundle exec shoryuken -R
Pablo, would you mind giving me any hint on what I might need try to handle this issue?
Hi @nfedyashev
There's a possibility that reusing sqs_client
is building up the memory consumption.
If your issue is easily reproducible, could you try to add the patch below in your config/initializers/shoryuken.rb
to always use a new instance of the SQS client?
# config/initializers/shoryuken.rb
# ...
module Shoryuken
class Options
class << self
def sqs_client
Aws::SQS::Client.new(
region: ENV["AWS_REGION"],
access_key_id: ENV["AWS_ACCESS_KEY_ID"],
secret_access_key: ENV["AWS_SECRET_ACCESS_KEY"]
)
end
end
end
end
This may cause some issues with FIFO queues, please, let me know if you are using FIFO, so I can review the patch.
Hi @phstc !
This may cause some issues with FIFO queues
I'm using Standard queues in this project.
Thanks for the patch! I've just deployed it. So far so good. I'll keep you updated.
@phstc the diff that I deployed ~24 hrs ago was:
--- a/config/initializers/shoryuken.rb
+++ b/config/initializers/shoryuken.rb
-Shoryuken.configure_client do |config|
- config.sqs_client = Aws::SQS::Client.new(
- region: ENV["AWS_REGION"],
- access_key_id: ENV["AWS_ACCESS_KEY_ID"],
- secret_access_key: ENV["AWS_SECRET_ACCESS_KEY"],
- #verify_checksums: false
- )
+
+#https://github.com/phstc/shoryuken/issues/558#issuecomment-581167683
+module Shoryuken
+ class Options
+ class << self
+ def sqs_client
+ Aws::SQS::Client.new(
+ region: ENV["AWS_REGION"],
+ access_key_id: ENV["AWS_ACCESS_KEY_ID"],
+ secret_access_key: ENV["AWS_SECRET_ACCESS_KEY"]
+ )
+ end
+ end
+ end
+ end
- Shoryuken.configure_server do |config|
- config.sqs_client = Aws::SQS::Client.new(
- region: ENV["AWS_REGION"],
- access_key_id: ENV["AWS_ACCESS_KEY_ID"],
- secret_access_key: ENV["AWS_SECRET_ACCESS_KEY"],
- #verify_checksums: false
- )
Looks like it didn't make much difference, similar pattern but perhaps 5% lower RAM consumption that before this update.
@phstc I've tried something else. Removed unnecessary require in configure_server
block:
Shoryuken.configure_server do |config|
- require Rails.root.join("app/jobs/vote_worker")
and memory consumption has reduced pretty noticeably(556MB->470MB max):
HTH
Hey @nfedyashev
Thanks for testing the SQS client hypothesis, so it is confirmed, there's no memory issue on re-using the client instance for an extended period.
I didn't see the require
in your first snippet.
Shoryuken.configure_server
should be called once at the application startup. I don't know if having a require
inside or outside would make much difference since it is called once.
Most of the memory issues I've seen so far are more related to the worker implementation rather than with Shoryuken. One hypothesis you can try is also a NOP worker, deploy an empty Shoryuken worker, and try to reproduce it, so you can confirm if the problem is with the worker implementation.
What's your worker doing? Some Ruby Gems can keep some state in memory, and that can build up over time.
@phstc
I didn't see the require in your first snippet.
right, sorry. I thought it was irrelevant to the question.
Thanks for the idea with NOP worker. I'll give it a try a bit later.
What's your worker doing? Some Ruby Gems can keep some state in memory, and that can build up over time. What's your worker doing?
Parse JSON payload, perform 5-10 SQL queries(via ActiveRecord), notify via Aws::CloudWatch::Client, send ActionCable notification.
Thanks
@nfedyashev I am also facing the same issue.
@bestborn14 sorry can't help you much with this one. It's no longer in production and I don't remember exact details other than what is written in this thread.
This issue is now marked as stale because it hasn't seen activity for a while. Add a comment or it will be closed soon.
This issue was closed because it hasn't seen activity for a while.
We just moved a large workload over to a pool of Shoryuken workers from Resque and have been noticing a memory leak (the same workload did not exhibit the same memory growth behavior in Resque).
This is our first use of Shoryuken, so we don't have another version to compare it to.
From Gemfile:
From Gemfile.lock:
config:
The above chart shows a per container memory usage over time (overlaid is an instance with the min, max, and the mean memory usage). All of the containers are cycled, and then you can see that they steadily grow over time. There are 10 processes in one container. Each process grows to use about 2 GB of memory before the container OOMs. I spent some time looking at heap dumps and it shows that there is a significant growth of the following items over time:
/usr/local/bundle/gems/aws-sdk-core-3.34.0/lib/seahorse/util.rb:9:STRING
/usr/local/bundle/gems/aws-sdk-core-3.34.0/lib/seahorse/client/handler_list_entry.rb:71:HASH
Updating the
aws-sdk-sqs
version to1.13
did not fix any issues, those two items still show up as culprits for leakage.Any suggestions or further questions would be greatly appreciated, thanks!