rectorphp / rector

Instant Upgrades and Automated Refactoring of any PHP 5.3+ code
https://getrector.com
MIT License
8.61k stars 680 forks source link

Parallel not working? #7431

Closed kkmuffme closed 1 year ago

kkmuffme commented 2 years ago

Bug Report

Subject Details
Rector version v0.14

Project/directory with about 30 files (small for testing reasons) Using $rectorConfig->disableParallel(); is not any slower than the default or explicit $rectorConfig->parallel(120, 16, 20);

Is parallel supposed to spawn separate PHP processes? (bc it doesn't as I only see 1 PHP CLI process in top)

TomasVotruba commented 2 years ago

On few files this has no effect. It's like using nitro for 20 meters road :)

What about 1000 files?

What OS do you use?

kkmuffme commented 2 years ago

I'm just used to psalm/phpstan and using threads has a net-positive when using just 2 files :-)

I guess since you added parallel you benchmarked this at some point in time. Is there a specific number of files which is the minimum to see a net positive? How much of a speed up did you see?

What OS do you use?

Rocky Linux 8

TomasVotruba commented 2 years ago

Before release we also got community testing with expected performance improvement. The Github runner provides just 2 cores, so it was only 2x faster. Running Rector on itself the improvement was much better, 16 x faster :)

Not sure what is different on your setup that you don't get these.

TomasVotruba commented 2 years ago

We'll need a reproducible repository here to verify your specific situation.

kkmuffme commented 2 years ago

Benchmarked this now and identified the issues:

For example: (ran this 10 times on identical/unmodified files, this is what I got on one run, but this is a good average of the 10 runs) 16/32 Core, 128GB RAM 30 files PHP 7.4

default:
real    0m57.997s
user    0m53.640s
sys     0m1.407s

$rectorConfig->parallel( 120, 16, 5 );
real    0m34.800s
user    0m51.219s
sys     0m1.258s

$rectorConfig->disableParallel();
real    0m49.216s
user    0m47.602s
sys     0m1.148s

The problems are: 1) the default value of $jobSize is too high for few files. Ideally it is:

$job_size = ceil( "non-skipped files to check" / $maxNumberOfProcess );

2) the default value of $jobSize is too low for many files. Ideally it is:

$job_size = (int) ceil( "non-skipped files to check" / $maxNumberOfProcess );

to avoid creating unnecessary forks, which only add overhead and slow it down.

However it may happen that some files are fixed faster than others, therefore we have some forks being done/idle before others are finished To accomodate for that:

$jobSize = (int) ceil( "non-skipped files to check" / $maxNumberOfProcess / 4 );

(/2 would be too low, as it will lead to a long tail; /3 is possible, /4 is kind-of-optimal*)

Attention: the $timeout must be adjusted for it here!

*for $maxNumberOfProcess = 16 and "non-skipped files to check" >=64

Question: the number of files being processed is not available via $rectorConfig (is it?) and passing it as a dynamic value via CLI is currently not possible either. Is this logic something that could be added to rector itself? Or would you prefer to make the number of non-skipped files available in $rectorConfig? Or if that is not possible either perhaps allow passing the jobSize value via CLI? Or any other way I missed?

TomasVotruba commented 2 years ago

The configuration is quite detailed now, and I want to keep it simpler. How do other parallel projects handle this?

kkmuffme commented 2 years ago

I don't know about "EasyParallel" specifically, as it's the first time explicitly using it, but commonly there is a CLI argument to specify e.g. "--threads" (e.g. psalm) or "--parallel" (e.g. phpcs). However:

As I have never used EasyParallel before, I am not sure how it forks the processes (will the forks run as separate processes on the CPU? do the forks share memory? how this it handle race conditions for file writes to the cache)

e.g. for phpcs you can just do --parallel=$(shell nproc || sysctl -n hw.logicalcpu || echo 16) to run on as many cores your CPU has. Hwoever rector's overhead makes it a bit more efficient to run with a jobSize > 1 (min 2 usually) from what it seems.

For me the issue is solved, as I implemented a custom logic for it for my use case. It's just for most people the current default config is quite often worse than not using parallel at all.

TomasVotruba commented 2 years ago

The EasyParallel is our package that is basically mirroring the PHPStan architecture. I wonder how is this handled there? It would be easiest to just mirror it.

kkmuffme commented 2 years ago

Phpstan for example https://github.com/phpstan/phpstan-src/blob/a97bdead71d24694dfe2bbe96de80486119b27a1/src/Parallel/Scheduler.php#L28-L45

The docs for phpstan parallel can be found here: https://phpstan.org/config-reference#parallel-processing

TomasVotruba commented 1 year ago

Closing as it seems stale. Feel free to provide patch to improve batch couting :+1: