ruby-shoryuken / shoryuken

A super efficient Amazon SQS thread based message processor for Ruby. This project is in MAINTENANCE MODE. Reach me out on Slack link on the description if you want to become a new maintainer.
Other
2.05k stars 280 forks source link

Memory Leak? #88

Closed sschepens closed 7 years ago

sschepens commented 9 years ago

Memory seems to grow slowly but without limit when using shoryuken with rails environment, a single worker and waiting on an empty queue.

Any ideas what could be causing this behaviour?

phstc commented 9 years ago

hey @sschepens which version of Shoryuken are you using? Old versions of Shoryuken had a problem with message_attributes, but @elsurudo fixed that https://github.com/phstc/shoryuken/commit/d0e69dc54eaebbd0f3ebe1622577857aa14a0b90

sschepens commented 9 years ago

@phstc I'm using latest version

leemhenson commented 9 years ago

ruby -v?

sschepens commented 9 years ago

ruby 2.2.0p0 (2014-12-25 revision 49005) [x86_64-linux-gnu]

leemhenson commented 9 years ago

What sort of growth are you talking about? Can you give numbers / new relic graphs etc?

sschepens commented 9 years ago

i'll leave a process running for a few hours recording heap metrics and will come back with that info

phstc commented 9 years ago

Any chance it's your worker itself causing the memory leak? Could you try to run Shoryuken with an empty perform?

Some questions for troubleshooting:

sschepens commented 9 years ago

Only one process with 5 concurrency, only one queue with a priority of 2. This behaviour happens on my laptop and in an EC2 instance, the process starts with around 150-160Mb of memory according to htop, and on an EC2 instance the process has been running for a day and it now consumes 700Mb. This instance hast 2000Mb of ram and 1 core. On my laptop (16Gb ram and 4 cores) after an hour it now consumes 200Mb and continues to grow but very slowly. I have recorded heap metrics every two minutes on my laptop but cant seem to find something that ould indicate the growth of 40 mb.

I dont think running with an empty perform will do much, as i'm hanging on an empty queue.

leemhenson commented 9 years ago

What happens if you stick in an occasional GC.start?

sschepens commented 9 years ago

That is exactly what i was doing now, and recording metrics before and after gc start

phstc commented 9 years ago

@sschepens sorry I missed that part:

waiting on an empty queue.

You are right, no point on testing an empty work.

I will try to run the same test here to see how it goes.

phstc commented 9 years ago

I kept it running for a few hours with default options (checking an empty queue) and the real memory size was consistent around 52MB.

dotenv bin/shoryuken -q default -r ./examples/default_worker.rb -p shoryuken.pid -l shoryuken.log -d

top -pid $(cat shoryuken.pid)

@sschepens are you using the default delay?

My ruby version ruby 2.0.0p451. I will bump my ruby to try it with 2.2.

sschepens commented 9 years ago

Yes, i'm using default delay. The symptom i'm having is that on weekends (when we don't deploy and we don't do processing) instances die because shotyuken uses up all memory. I have two different shoryuken processes on two separate machines and both die. The things they have in common:

I still have to record some useful heap information to get a clue of what's going on.

phstc commented 9 years ago

hm I will try to run that with Rails, because running with Ruby 2.2.1 the real memory size was consistent around 30MB (better than with 2.0)

sschepens commented 9 years ago

Ok, i'm using ruby 2.2.0 and it seems to have a memory leak on symbol gc: https://bugs.ruby-lang.org/issues/10686 Gonna try updating to 2.2.1 to see if it gets fixed

curtislinden commented 9 years ago

We are also experiencing memory run-out problems

We are running a worker process in docker with a memory cap of 500MB.

We have filled our queue with fake data that can be partially processed and will generate an exception after a few lines of code. This keeps the worker busy and makes it easier to monitor it's memory usage behavior.

After 12 hours the process hits the memory cap and is shutdown.

We are running ruby mri 2.1.5 and Rails 4.1 environment with Shoryuken 1.0.1

Same issue occurs with jruby 9.0.0.0.pre1 and Shoryuken 1.0.2.

Essentially the process grows overtime and then hits some upper bound and that's all she wrote.

phstc commented 9 years ago

I'm using ruby 2.2.0 and it seems to have a memory leak on symbol gc: https://bugs.ruby-lang.org/issues/10686

@curtislinden could you try ruby 2.2.1?

sschepens commented 9 years ago

I still have leaks on ruby 2.2.1 and 2.2.2 :( Didnt have time to get good traces of where the memory is leaking

curtislinden commented 9 years ago

Same issues with 2.2.2

I found a similar conversation in celluloid and have some insight to share.

I believe this will affect shoryuken as I believe in particular when an exception is caught and the celluloid actor is killed it doesn't have it's memory cleaned up. It may be true for actors that complete a task successfully - as-well


https://github.com/celluloid/celluloid/issues/463#issuecomment-93838990

@curtislinden try to delete actor instance variables in the celluloid finalizer. The "celluloized" object is known to be never collected by GC. You may check a w/a for DCell: niamster/dcell@61d4e0d

sschepens commented 9 years ago

The problem i believe is that shoryuken doesn't recreate actors, they are long living and only recreated if an exception happened (and only workers i believe)

curtislinden commented 9 years ago

Since Celluloid was written in the spirit of erlang - I assume that "fail-fast" and "let it die" were good patterns to follow with our workers. In that spirit when our workers can't process a job - they throw exceptions for various cases. In practice these exceptions are common but temporary.

Since we are throwing exceptions often in our current data-set - I suspect given @sschepens belief about recreation exception handling in shoryuken may be the place to have cleanup.

sschepens commented 9 years ago

I know i've seen exceptions being swallowed in shoryuken, for example here: fetcher Also, watchdog annotation swallows exceptions and only logs them and seems to be used a lot.

Don't know if there are more occurrences of these, and i'm not really sure if those are causing issues. Maybe shoryuken should just let exceptions kill all actors and use supervise or Celluloid::SupervisionGroup to autostart actors.

sschepens commented 9 years ago

@phstc @curtislinden @leemhenson Just wanting to let you guys I created a new project I took some code from shoryuken but simplified the architecture. It's not forked because some things are just a complete rewrite. I saw shoryuken as too complex for what it does or should do (in my opinion :)) and was trying to hunt for possible memory leaks, so I made a few simplifications:

I've also removed some code which i didn't need right now, such as ActiveJob integration.

Anyway, I have been testing and my processes have now been running for a week more or less and I have not had any memory issues. The project is not documented at all right now and has no tests, but just wanting to let you know that it existed. Please tell me if you have any issues with me taking code from shoryuken, and if you like it maybe we can work on merging both of them.

Cheers! Sebastian

phstc commented 9 years ago

I know i've seen exceptions being swallowed in shoryuken, for example here: fetcher Also, watchdog annotation swallows exceptions and only logs them and seems to be used a lot.

Don't know if there are more occurrences of these, and i'm not really sure if those are causing issues. Maybe shoryuken should just let exceptions kill all actors and use supervise or Celluloid::SupervisionGroup to autostart actors.

Fetcher exceptions should be rare. And if a processor dies, it will be handled by manager/trap_exit.

I totally believe in your and I'm sure this issue is happening for you. But I couldn't reproduce it in my production and development environment.

Please tell me if you have any issues with me taking code from shoryuken, and if you like it maybe we can work on merging both of them.

:+1: for the initiative! Feel free to copy any code you need! A did the same with Sidekiq. Remember to keep the LGPLv3 License.

@sschepens @curtislinden as you can reproduce the issue, could you try :pray:

# https://github.com/phstc/shoryuken/blob/master/lib/shoryuken/manager.rb#L78-79
@threads.delete(processor.object_id)
@busy.delete processor
remove_instance_variable processor

???

Following https://github.com/niamster/dcell/commit/61d4e0d919ae702df22dc9472aec57f5d7aa32ee?

curtislinden commented 9 years ago

Hey! yeah We can try this

On Mon, Apr 27, 2015 at 6:36 PM, Pablo Cantero notifications@github.com wrote:

I know i've seen exceptions being swallowed in shoryuken, for example here: fetcher Also, watchdog annotation swallows exceptions and only logs them and seems to be used a lot.

Don't know if there are more occurrences of these, and i'm not really sure if those are causing issues. Maybe shoryuken should just let exceptions kill all actors and use supervise or Celluloid::SupervisionGroup to autostart actors.

Fetcher exceptions should be rare. And if a processor dies, it will be handled by manager/trap_exit https://github.com/phstc/shoryuken/blob/master/lib/shoryuken/manager.rb#L11 .

I totally believe in your and I'm sure this issue is happening for you. But I couldn't reproduce it in my production and development environment.

Please tell me if you have any issues with me taking code from shoryuken, and if you like it maybe we can work on merging both of them.

[image: :+1:] for the initiative! Feel free to copy any code you need! A did the same https://github.com/phstc/shoryuken#credits with Sidekiq. Remember to keep the LGPLv3 License https://github.com/phstc/shoryuken/blob/master/LICENSE.txt.

@sschepens https://github.com/sschepens @curtislinden https://github.com/curtislinden as you can reproduce the issue, could you try [image: :pray:]

https://github.com/phstc/shoryuken/blob/master/lib/shoryuken/manager.rb#L78-79

@threads.delete(processor.object_id) @busy.delete processor remove_instance_variable processor

???

Following niamster/dcell@61d4e0d https://github.com/niamster/dcell/commit/61d4e0d919ae702df22dc9472aec57f5d7aa32ee?

— Reply to this email directly or view it on GitHub https://github.com/phstc/shoryuken/issues/88#issuecomment-96871350.

Curtis J Schofield | Senior Systems Analyst | U2U Skype robotarmymade | Second Life Curtis Linden | C 415.632.6001

Linden Lab | Makers of Shared Creative Spaces

phstc commented 9 years ago

hey @curtislinden did you have a chance to try that?

curtislinden commented 9 years ago

Hey - Super interested in trying this out - just have to get some time set aside for this. I'll see If i can get to it today.

On Fri, May 8, 2015 at 5:11 AM, Pablo Cantero notifications@github.com wrote:

hey @curtislinden https://github.com/curtislinden did you have a chance to try that?

— Reply to this email directly or view it on GitHub https://github.com/phstc/shoryuken/issues/88#issuecomment-100213119.

Curtis J Schofield | Senior Systems Analyst | U2U Skype robotarmymade | Second Life Curtis Linden | C 415.632.6001

Linden Lab | Makers of Shared Creative Spaces

curtislinden commented 9 years ago

Hi - tried it out and it appears that processor isn't an instance variable and also remove_instance_variable is not an expected menthod... I figured this might be because of the 'poetic mode) but it doesn't make sense to remove_instance_variable(:@processor) as that is a local var.

2015-05-08T20:42:57.461708742Z 20:42:57 worker.1 | 2015-05-08T20:42:57Z 14 TID-otkh73en8 ERROR: /usr/local/bundle/gems/celluloid-0.16.0/lib/celluloid/proxies/sync_proxy.rb:23:in method_missing' 2015-05-08T20:42:57.461798959Z 20:42:57 worker.1 | /usr/local/bundle/gems/celluloid-0.16.0/lib/celluloid/proxies/sync_proxy.rb:18:inrespond_to?' 2015-05-08T20:42:57.462189657Z 20:42:57 worker.1 | /usr/local/bundle/bundler/gems/shoryuken-49574533fa26/lib/shoryuken/manager.rb:80:in `remove_instance_variable'

phstc commented 9 years ago

Good point! The processors are elements inside the arrays @ready and @busy (which are instance variables).

The processors flow:

Based on what I understood on this thread, the processors might be kept in memory when they die, increasing the memory usage. The remove_instance_variable was an attempt to make sure that we are removing them out from memory when they die. But we probably need to find a different solution for that.

curtislinden commented 9 years ago

Interesting.. I think the hypothesis is indeed that the processors might be kept in memory after they die.

This is an interesting problem.

On Sun, May 10, 2015 at 3:05 PM, Pablo Cantero notifications@github.com wrote:

Good point! The processors are elements inside the arrays @ready and @busy (which are instance variables).

The processors flow:

Based on what I understood on this thread, the processors might be kept in memory when they die https://github.com/phstc/shoryuken/blob/master/lib/shoryuken/manager.rb#L78-79, increasing the memory usage. The remove_instance_variable was an attempt to make sure that we are removing them out from memory when they die. But we probably need to find a different solution for that.

— Reply to this email directly or view it on GitHub https://github.com/phstc/shoryuken/issues/88#issuecomment-100705726.

Curtis J Schofield | Senior Systems Analyst | U2U Skype robotarmymade | Second Life Curtis Linden | C 415.632.6001

Linden Lab | Makers of Shared Creative Spaces

esebastian commented 9 years ago

Same problem here. I've tried both nilling processor and reuse it once deleted from @busy but without success.

What I've noticed is that when processor is reused without building a new one with build_processor, no new messages are processed once the number of messages matches the concurrency configuration (i.e. all processors have been used once), and memory keeps stable.

I know this doesn't discard a problem on the processors part (indeed it could be a new evidence), but could it be possible that there may be a problem on the message parsing and digesting part?

phstc commented 9 years ago

I couldn't reproduce it locally, but based on what I understood it's happening when the workers are raising exceptions, right? Do you know what kind of exceptions you workers are raising? So I can try to reproduce the same here.

Does Shoryuken show a reason in the logs?

What I've noticed is that when processor is reused without building a new one with build_processor, no new messages are processed once the number of messages matches the concurrency configuration (i.e. all processors have been used once), and memory keeps stable.

Actually it stops fetching messages only if there isn't any ready processor, if there's at least one, it will try to fetch a message for it.

I know this doesn't discard a problem on the processors part (indeed it could be a new evidence), but could it be possible that there may be a problem on the message parsing and digesting part?

The way Shoryuken manages died processors is the same way Sidekiq does - so or both have the same problem, or the problem is in another place in Shoryuken. It could be also with the aws-sdk, but it's strange as I couldn't reproduce it here. Could you create a sample project that causes this problem?

esebastian commented 9 years ago

On the exceptions part, yes, workers are raising exceptions in a steady manner; but custom ones.

Perhaps I'm doing it the wrong way, but what I do is to raise an exception for every unprocessable message in order to automatically have them reprocessed later. I mean, if I can't process one message in a particular point of time, I raise an exception in order to enqueue that message again and process it later, doing it until the processing can be carried out.

Is there another approach I should be taking?

I'll try set up that sample project for you.

sschepens commented 9 years ago

@esebastian Raising exceptions as a method of flow control is not a good practice in most programming languages and is slower. It's always recommended to not raise exceptions if not necessary. Using shoryuken you can disable automatic message ack and do it yourself when done processing a message.

curtislinden commented 9 years ago

@sschepens : I think there is a difference between designing with exceptions in mind : http://www.c2.com/cgi/wiki?DesigningWithExceptions - and using it for control flow - http://www.c2.com/cgi/wiki?DontUseExceptionsForFlowControl

In this case of queue processing - there is an issue with the data, an exceptional condition is utilized - and the contract cannot be occurred - in this case what is akin to a transaction is utilized and the message is not removed from the queue.

This seems like appropriate design for exceptions. Given that the author chose to use celluloid - following the pattern of 'let it fail' from erlang

On Thu, May 14, 2015 at 8:48 AM, sschepens notifications@github.com wrote:

@esebastian https://github.com/esebastian Raising exceptions as a method of flow control is not a good practice in most programming languages and is slower. It's always recommended to not raise exceptions if not necessary. Using shoryuken you can disable automatic message ack and do it yourself when done processing a message.

— Reply to this email directly or view it on GitHub https://github.com/phstc/shoryuken/issues/88#issuecomment-102079503.

Curtis J Schofield | Senior Systems Analyst | U2U Skype robotarmymade | Second Life Curtis Linden | C 415.632.6001

Linden Lab | Makers of Shared Creative Spaces

phstc commented 9 years ago

But maybe as a test, just to make sure that the processor_died is the cause of these memory problems.

@esebastian Could you try:

class MyWorker
  include Shoryuken::Worker

  shoryuken_options queue: 'my_queue', auto_delete: false

  def perform(sqs_msg, body)    
    # do something that might raise an exception
    sqs_msg.delete
  rescue
    # if it fails and you don't "ack" `sqs_msg.delete` 
    # the message will return to the queue after its visibility timeout expiration
  end
end
sschepens commented 9 years ago

@curtislinden I think he wasn't talking that there's something wrong with the data, it just wasn't ready to be processed yet he said, that's sort of control flow, saying hey this isn't ready yet, let's just throw an exception, so my automatic ack is skipped, specially when this is the only thing you're aiming for. It could easily be replaced by disabling automatic ack and an immediate return of perform method whenever something is not right. I like designing with exceptions in mind, i mostly do, but I don't think this is a nice use case. Talking about Celluloid, I think it's even more important to try and raise fewer exceptions, Erlang is designed for this, Ruby and Celluloid are not. Each Celludloid actor has a thread of it's own and has a lot of overhead in creating/destroying them, this should definitely be considered.

esebastian commented 9 years ago

@phstc: I've tried as you asked and the behavior is the same, it keeps leaking memory. Do you still need that sample project?

@sschepens: I think there's a case for anything, but I see your point. I couldn't have expressed mine better than @curtislinden but I will give yours a try. Thanks for the advice.

esebastian commented 9 years ago

Any update on this? Can I be of any help?

phstc commented 9 years ago

Do you still need that sample project?

@esebastian could you create a sample project that causes this memory leak problem? So I can reproduce it locally?

Thanks

esebastian commented 9 years ago

@phstc Ok, I'll get into it

phstc commented 9 years ago

@esebastian tks a lot :beers:

phstc commented 9 years ago

@esebastian could you try the master version again, @gshutler did a change that might help with the memory leak problem.

esebastian commented 9 years ago

@phstc I'm afraid that the problem persists. I've had the master version running since past Sunday, with different load configurations, and the leak still exists :(

esebastian commented 9 years ago

@phstc I'll post a link for the sample project as soon as possible

phstc commented 9 years ago

@esebastian thanks. that will help me to debug it.

gshutler commented 9 years ago

I'm starting to suspect we're seeing this too. Going to attempt to find the smallest case that reproduces it. Running Ruby v2.2.2

screen shot 2015-06-19 at 16 09 30

Guess where Shoryuken gets restarted by a new deployment :wink:

gshutler commented 9 years ago

I think I've got something, it's confusing as hell, taken me on a long, winding road, but I think I've got something.

I'd done a general skim over the code before due to this issue being open and nothing stood out, so it was time to get scientific.

I added this into Manager#dispatch to get some high level figures to track:

@iterations ||= 0
@iterations += 1

if @iterations % 5000 == 0
  ObjectSpace.garbage_collect
  logger.warn "#{@iterations},#{Time.now.getutc},#{`ps -o rss -p #{$$}`.strip.split.last.to_i * 1024}"
end

I'd also output ObjectSpace.count_objects before and after GC at various times during my search, 5000 also increased or decreased along the way.

I started looking at polling an empty queue, regardless of the number of workers this should mean only the manager and fetcher actors are doing anything. Also, this is the only section of Shoryuken that differs significantly from Sidekiq and so was my prime suspect. The growth was still observed so I tried to isolate it further, therefore I changed Fetcher#received_messages to do everything but the request and instead just return an empty array to simulate an empty queue. This sped testing up and I was getting similar looking memory growth per iteration, happy days.

ObjectSpace.count_objects wasn't flagging anything weird to me, the number of objects weren't increasing out of control, if at all, the GC seemed to always drag things back to the same level, but the memory used by the process would continue to increase.

I tried many things that seemed like obvious targets and would remove them from the loop. For example not calling Shoryuken.worker_registry.batch_receive_messages? not doing anything at all within Fetcher#receive_messages other than returning an empty array but nothing seemed to make a difference.

Eventually I realised that Fetcher#fetch wouldn't really be doing anything other than calling @manager.async.dispatch to complete the loop, so I started moving @manager.async.dispatch and return down within Fetcher#fetch line-by-line.

I found that introducing this line made the memory growth return:

logger.debug "Looking for new messages in '#{queue}'"

I didn't believe an innocent looking log message could cause any harm so I repeated the test, but its introduction definitely introduce the problem. Eventually, I removed all the logging from Fetcher and it did indeed look like the problem had mostly disappeared. Obviously we wouldn't want to lose the logging so I put it back and tried attacking the logging code directly. However, nothing I tried removed the problem.

Eventually I tried converting the logging to a block call style everywhere as this is used in some places and not all. For example:

logger.debug { "Looking for new messages in '#{queue}'" }

Now this seemed to all but eliminate the growth as this chart shows:

memory-growth

I don't understand this if I'm honest. However, the experiments I've run have me convinced this is the source of the majority of the problem. I wondered if Rails is somehow messing with the logger so it's buffering the log messages or something along those lines but I tried various Rails log flushing things and it didn't work.

So there we go, I think it's caused by the logging messages but I have no idea why. I'm going to do a release with a custom branch to see how it behaves in production and will report back.

curtislinden commented 9 years ago

This is a very interesting discovery and effective analysis and graph. What did you use to make the chart?

On Fri, Jun 19, 2015 at 11:30 AM, Garry Shutler notifications@github.com wrote:

I think I've got something, it's confusing as hell, taken me on a long, winding road, but I think I've got something.

I'd done a general skim over the code before due to this issue being open and nothing stood out, so it was time to get scientific.

I added this into Manager#dispatch to get some high level figures to track:

@iterations ||= 0@iterations += 1 if @iterations % 5000 == 0 ObjectSpace.garbage_collect logger.warn "#{@iterations},#{Time.now.getutc},#{ps -o rss -p #{$$}.strip.split.last.to_i * 1024}"end

I'd also output ObjectSpace.count_objects before and after GC at various times during my search, 5000 also increased or decreased along the way.

I started looking at polling an empty queue, regardless of the number of workers this should mean only the manager and fetcher actors are doing anything. Also, this is the only section of Shoryuken that differs significantly from Sidekiq and so was my prime suspect. The growth was still observed so I tried to isolate it further, therefore I changed Fetcher#received_messages https://github.com/phstc/shoryuken/blob/3d1e044f60b340724347d324f044370578b9d18f/lib/shoryuken/fetcher.rb#L21 to do everything but the request and instead just return an empty array to simulate an empty queue. This sped testing up and I was getting similar looking memory growth per iteration, happy days.

ObjectSpace.count_objects wasn't flagging anything weird to me, the number of objects weren't increasing out of control, if at all, the GC seemed to always drag things back to the same level, but the memory used by the process would continue to increase.

I tried many things that seemed like obvious targets and would remove them from the loop. For example not calling Shoryuken.worker_registry.batch_receive_messages? not doing anything at all within Fetcher#receive_messages other than returning an empty array but nothing seemed to make a difference.

Eventually I realised that Fetcher#fetch wouldn't really be doing anything other than calling @manager.async.dispatch to complete the loop, so I started moving @manager.async.dispatch and return down within Fetcher#fetch line-by-line.

I found that introducing:

logger.debug "Looking for new messages in '#{queue}'"

I didn't believe an innocent looking log message could cause any harm so I repeated the test, but its introduction definitely introduce the problem. Eventually, I removed all the logging from Fetcher and it did indeed look like the problem had mostly disappeared. Obviously we wouldn't want to lose the logging so I put it back and tried attacking the logging code directly. However, nothing I tried removed the problem.

Eventually I tried converting the logging to a block call style everywhere as this is used in some places and not all. For example:

logger.debug { "Looking for new messages in '#{queue}'" }

Now this seemed to all but eliminate the growth as this chart shows:

[image: memory-growth] https://cloud.githubusercontent.com/assets/81344/8260071/2a0195c4-16b9-11e5-991b-c1bf7169575c.png

I don't understand this if I'm honest. However, the experiments I've run have me convinced this is the source of the majority of the problem. I wondered if Rails is somehow messing with the logger so it's buffering the log messages or something along those lines but I tried various Rails log flushing things and it didn't work.

So there we go, I think it's caused by the logging messages but I have no idea why. I'm going to do a release with a custom branch to see how it behaves in production and will report back.

— Reply to this email directly or view it on GitHub https://github.com/phstc/shoryuken/issues/88#issuecomment-113598856.

Curtis J Schofield | Senior Systems Analyst | U2U Skype robotarmymade | Second Life Curtis Linden | C 415.632.6001

Linden Lab | Makers of Shared Creative Spaces

gshutler commented 9 years ago

The log messages I added are comma separated values, I just dropped them into Excel to make the chart.