I found the cause of SQS Queue Worker's memory leak

sh-ogawa commented 6 years ago

This behavior sounds similar to #1645 where we found that the SDK (or one of its dependencies) does appear to have cyclic references that have to be cleaned up by PHP's garbage collector, however garbage collection does occur naturally after a certain amount of references have built up.

This problem seems to be due to the implementation of SqsClient 's promise usage, which still has references left.

However, this problem can be solved by using gc_collect_cycles().

The easiest place to insert is in the Illuminate\Queue\Worker::daemon(). But I do not know if I can put it here, so I raised the issue.

We will share the exchange with aws-sdk-php issue, so would you please proceed with the direction to fix this problem?

https://github.com/aws/aws-sdk-php/issues/1667#issuecomment-436079130

deleugpn commented 5 years ago

Since this is a problem in a specific driver, it would be better if we could solve it at the driver itself. Perhaps at the class Laravel uses to interact with the SDK.

sh-ogawa commented 5 years ago

It is true as it is, but it is impossible to put it in because it can not do postprocessing as it is currently implemented.

sisve commented 5 years ago

Would it work to call gc_collect_cycles() in the Looping event?

deleugpn commented 5 years ago

I was thinking that it could be something done right at the pop method of SqsQueue (https://github.com/laravel/framework/blob/5.7/src/Illuminate/Queue/SqsQueue.php#L117).

The reasoning is that since it's a SqsQueue problem, it might be best to have a SqsQueue solution instead of affecting all drivers. Of course, if it doesn't cause harm, we might as well do it in the Worker loop.

Ideally, we'd do it right after, which means we would add a worked method to the Job contract and have it called after working a job. The SqsQueue implementation could do the clean up. However, given the lack of need for a worked until now, we might want to hold off that approach until it's more needed. Hence the proposal for the pop. Right before getting a new Job, let's clear (gc_collect_cycles) any residue from previous jobs.

sisve commented 5 years ago

Keeping it in SqsQueue would make sense in the context of focusing on this single problem. Putting it in the Worker makes sense if you consider that it's the Worker's responsibility to shut down when memory usage exceeds what is set by --memory. I was slightly surprised that the Worker didn't call gc_collect_cycles before shutting down to see if it could free memory.

sh-ogawa commented 5 years ago

Call a gc_collect_cycles before sqs pop, if there are similar problems in future, will not you do the same kind of correspondence?

Mr, @sisve It feels good, certainly it is good to release memory once before checking if shutdown the queue worker when used memory exceeds.

Laravel users will not be (maybe) welcomed to shutdown queue workers due to memory overruns.

t202wes commented 5 years ago

This still hasn't been updated in a recent release has it?

Is it possible just to run this function from a cronjob to resolve the memory issue?

Are we able to insert gc_collect_cycles() into our own codebase, into each command to easily resolve this?

What's the easiest way for something to implement this fix, without having to wait for Laravel, any other packages to update, or trying to host our own customized packages outside our main application files?

mfn commented 5 years ago

Are we able to insert gc_collect_cycles() into our own codebase, into each command to easily resolve this?

If you are not using horizon, I think you can:

extend \Illuminate\Queue\Console\WorkCommand (i.e. create your own version of the command) which
__construct receives your Worker class which extends \Illuminate\Queue\Worker and overrides the daemon method and adds gc_collect_cycles()
and then call that particular command

If you're using horizon, I'm not sure. Because horizon itself employs this very same technique, it extends \Illuminate\Queue\Console\WorkCommand as part of \Laravel\Horizon\Console\WorkCommand and you would somehow need to find a way to teach horizon not to invoke the horizon:work but your own version the new command, etc.

I think that Laravels worker could benefit from calling gc_collect_cycles() but one first have to proof someone it has no ill effects.

I.e. performance might be affected when a worker runs many many small/fast jobs and thus calls gc_collect_cycles() often, thus redundantly. Maybe it should be able to be tuned (only trigger once in a while) or turned off/on. But asking these question already shows it needs more preparation to land this properly.

sh-ogawa commented 5 years ago

I was struggling to incorporate opinions. In my opinion, I think that real time nature has not been requested so far at the time of queue job, and our team does not ask for it. If you are really looking for real-time nature, just use another method.

My problem is that the queue job will continue to increase memory just by checking for jobs. This may also affect other programs running on the same server. Also, if you are using EC 2, this problem may cause disk IO to occur frequently and you may need to pay CPU credits. Is this obviously disadvantageous to the user?

sisve commented 5 years ago

@sh-ogawa Did you try calling call gc_collect_cycles() in a custom listener that is listening for the Looping event? It should be a workaround for your application until a suitable framework solution is implemented.

sh-ogawa commented 5 years ago

@sisve Is it about our queue job? If it is, that can not be done. This is because the SQS Queue Worker just increases the memory by simply checking the message in SQS. If there is no message at this time, our queue job will not be called, so there is no timing to call gc_collect_cycles.

sisve commented 5 years ago

You're probably thinking of the JobProcessing/JobProcessed/JobFailed events which is tied to a specific job and wouldn't execute on an empty queue.

The Looping event is fired every time the worker checks for an available job (and either executes it, or sleep if there was no available job). I use dedicated listener classes for my logic, but you can just throw this into your favorite service provider to execute gc_collect_cycles() between every job. It's probably not appropriate to use on a framework level, but you can probably use it as a workaround in your code base until Laravel has better fix.

Queue::looping(function() {
    gc_collect_cycles();
});

sh-ogawa commented 5 years ago

@sisve thank you!! I write that in AppServiceProvider and confirmed that memory does not increase. As a provisional correspondence, write this code and avoid it.

mfn commented 4 years ago

There was an update at https://github.com/aws/aws-sdk-php/issues/1667#issuecomment-554228924 which points to an interesting thing:

The short of it being something with cURL SSL verification being leaky.

From https://forums.aws.amazon.com/message.jspa?messageID=695298 (linked in above comment):

memory_get_usage() does NOT report the leak, it shows steady usage. Using top and ps aux --sort -vsz | head from the CLI shows the VSize and RSize memory increasing consistently until the entire instance goes down.

I think the bug is either in libcurl or in the SDK's usage of curl, regarding SSL peer verification. It does not matter which service we were contacting; S3, SQS, DynamoDb, and others all exhibit the same behavior in a long-running process.

Turning off peer verification as in the following code fixed the leak:

laravel / ideas

I found the cause of SQS Queue Worker's memory leak #1380