phpro / grumphp

A PHP code-quality tool
MIT License
4.15k stars 431 forks source link

Memory is not being released in parallel execution #1101

Closed zoilomora closed 2 months ago

zoilomora commented 1 year ago
Q A
Version 2.0.0
Bug? yes
New feature? no
Question? yes
Documentation? no
Related tickets ~

When executing tasks with parallel enabled: true, the memory is not being released and it is exceeding the limit established in PHP.

My configuration

grumphp:
  process_timeout: 120
  ascii:
    failed:
      - config/hooks/ko.txt
    succeeded:
      - config/hooks/ok.txt
  parallel:
    enabled: true
    max_workers: 32
  tasks:
    composer:
      strict: true
    jsonlint: ~
    phpcpd:
      exclude:
        - 'var'
        - 'vendor'
        - 'tests'
      min_lines: 60
    phpcs:
      standard:
        - 'phpcs.xml.dist'
      whitelist_patterns:
        - '/^src\/(.*)/'
        - '/^tests\/(.*)/'
      encoding: 'UTF-8'
    phplint: ~
    phpstan_shell:
      metadata:
        label: phpstan
        task: shell
      scripts:
        - ["-c", "phpstan analyse -l 9 src"]
    phpunit: ~
    behat:
      config: ~
      format: progress
      stop_on_failure: true
    phpversion:
      project: '8.2'
    securitychecker_local:
      lockfile: ./composer.lock
      format: ~

Steps to reproduce: At the end of the vendor/bin/grumphp file add the following to check memory usage:

$memory = memory_get_usage() / 1024 / 1024;
print_r(round($memory, 3) . ' MB' . PHP_EOL);
exit();

Run ./vendor/bin/grumphp run once with each of this options:

Result:

parallel: false
Used Memory: 32.553 MB

parallel: true
Used Memory: 215.642 MB

When the different tasks are finished, shouldn't the memory be released?

Is this the desired behavior?

veewee commented 1 year ago

Running grumphp in parallel mode opens up a separate process for every task you start. There is communication between those 2 processes and that's probably what is taking up the additional MBs of space. I'm not sure if that memory needs to get manually freed.

However, grumphp is just a tool that finishes at some point. At that moment, the memory gets freed nevertheless. Therefore I am not sure if this really is an issue.

So what do you think about this? Is this really a problem or is the problem rather that you need to increase PHP's memory limit in order to get grumphp running on your project?

zoilomora commented 1 year ago

I understand that if they are separate PHP processes, they should have the memory limit on each process.

I try to have the same memory limits in local as in production.

If the separate processes do not take up more than 32MB, it seems strange to me that all the tasks in the different processes take up more than 215MB.

The current limit is 256 MB and if I include more files the memory is exceeded. However, memory runs out when there is only 1 task left to finish.

I understand that the desired behavior would be: release that memory as tasks finish?

ashokadewit commented 4 months ago

I think the memory goes to the serialized task results. The task result contains the context, which contains the file collection. If not running in parallel this object is passed by reference, but when running in parallel it is serialized for each result. If the amount of files is large (5000 files in my case) and there are many tasks (20 in my case) GrumPHP will run out of memory. I solved it by registering a middleware to replace the file collections with an empty object. I'm not sure if this file collection is used in any way after a task has completed.

veewee commented 4 months ago

I solved it by registering a middleware to replace the file collections with an empty object.

Can you share your solution?

I'm not sure if this file collection is used in any way after a task has completed.

Currently not in this repository. However it's an official extension point, so one might be using that as a feature.

What I'm wondering is: Once the task has been executed in a separate worker, the serialized version is not being used anymore, meaning that it should be garbage collected at that point. So I assume the problem is that the context in the result is the serialized worker context instead of the initial process' context. So it might make sense to swap it back to the original reference, after which garbage collection kicks in?

ashokadewit commented 4 months ago

Can you share your solution?

Sure:

class UnsetFilesMiddleware implements TaskHandlerMiddlewareInterface
{
    /**
     * Unset files from task results.
     *
     * @param TaskInterface     $task
     * @param TaskRunnerContext $runnercontext
     * @param callable          $next
     *
     * @return Promise
     */
    public function handle(TaskInterface $task, TaskRunnerContext $runnercontext, callable $next): Promise
    {
        $result = $next($task, $runnercontext);
        if ($result instanceof Promise) {
            $result->onResolve(
                function ($exception, $value): void {
                    if ($value instanceof TaskResult) {
                        $property = new ReflectionProperty($value, 'context');
                        $property->setAccessible(true);
                        $property->setValue($value, new RunContext(new FilesCollection([])));
                    }
                }
            );
        }

        return $result;
    }
}

And then registered in grumphp.yml with:

  My\UnsetFilesMiddleware:
    tags:
      - name: grumphp.task_handler
        priority: 500

I'm still using version 1.5.1 of Grumphp, not sure if this also compatible with the newest version.

veewee commented 4 months ago

Can you verify swapping the "serialized" context coming back from the worker with the original context also does the trick?

                        $property = new ReflectionProperty($value, 'context');
                        $property->setAccessible(true);
-                       $property->setValue($value, new RunContext(new FilesCollection([])));
+                       $property->setValue($value, $runnerContext);

I'm still using version 1.5.1 of Grumphp, not sure if this also compatible with the newest version.

In 2.0 the async execution system changed but it is still using the context coming back from the worker. So I assume it will have similar issues.

ashokadewit commented 4 months ago

Can you verify swapping the "serialized" context coming back from the worker with the original context also does the trick?

It does :)

veewee commented 3 months ago

@ashokadewit Can you confirm the fix in #1147 would resolve the issue?