rectorphp / rector

Instant Upgrades and Automated Refactoring of any PHP 5.3+ code
https://getrector.com
MIT License
8.59k stars 680 forks source link

Global cache for huge projects #8724

Closed MauroVinicius closed 1 month ago

MauroVinicius commented 2 months ago

Feature Request

I work on a giant system with more than 5,000 php files, Rector works perfectly on all of them, but there is a catch: it is slow to do the initial analysis, taking about 4 minutes or more, and whenever we need to add or change something rule to test some things takes the same time again

To solve this I believe that a new type of cache can be added, a kind of single cache file for the entire project, instead of Rector reading 5,000 small files one by one it can read a single giant file that contains the contents of all the 5,000 files, this should speed up the initial analyzes at first, and using the filemtime() function it would be possible to know if the file was modified and update its content within this giant cache file which would be a json with the key being the path of the file and value being the content of the file and another key marking the file modification time, I applied this solution to a similar problem here and it greatly accelerated the initial processing

The same can be done with the analysis, the results can be saved in another giant cache file and gradually accumulate as new analyzes are carried out, this would take up much more hard disk space but would make Rector much faster in all cases. scenarios

TomasVotruba commented 2 months ago

Caching on parallel run is quite complex. This would require actually working prototype, to see how it performs.

Are you up to the challange? :)

MauroVinicius commented 2 months ago

Caching on parallel run is quite complex. This would require actually working prototype, to see how it performs.

Are you up to the challange? :)

@TomasVotruba Now that you said it, I can accept the challenge, I will need this resource anyway and I can use the project I have here to do all the speed tests

I can't promise a deadline at this point, but if you explain to me on a more technical level how the cache is working today I'll try some ideas here

I have 8 years of experience with PHP, MySQL, Git, HTML, CSS, Javascript and Typescript, and I am the lead programmer here at the company, so I believe I am capable of meeting this challenge

staabm commented 2 months ago

At first we would need a repro of the problem at hand so we came profile the actual bottleneck.

Adding caching without evidence should be prevented at all cost

staabm commented 1 month ago

btw: @MauroVinicius maybe you are running rector on a file which is already slow on PHPStan (since rector uses phpstan under the hood).

it might be helpful to looks for outstanding slow files in PHPStan analysis. see https://phpstan.org/blog/debugging-performance-identify-slow-files

MauroVinicius commented 1 month ago

@staabm That's good to know, I'm going to do the tests with PhpStan here on my PC creating a modified version of Rector, if I'm successful I'll let you know here

MauroVinicius commented 1 month ago

@staabm I've already discovered the first step to speeding up Rector's analysis

I noticed that any change made to the configuration file ended up analyzing all the files again through PHPStan, and it really is a little slow, but I found a way to apply a specific cache

In the Rector\Application\FileProcessor class, any change to the settings with Rector\Config\RectorConfig activates the parseFileNodes() function again to obtain the tokens, if I add a new rule it goes here, if I remove a rule it returns here again, and so on. against

The part that takes time is when you get to $this->nodeScopeAndMetadataDecorator->decorateNodesFromFile($file->getFilePath(), $oldStmts) which can take a second or more depending on the case, but you don't need to repeat this token analysis all the time all

Whenever the content of the file is the same, the result can be saved in a cache file with a name in md5, this way we avoid repeating the analysis all the time when changing a configuration

Below I made a simple test code, I have no experience with PHPStan so I couldn't deserialize correctly from json, but I imagine it's possible to do this somehow

    private function parseFileNodes(File $file): void
    {
        // store tokens by original file content, so we don't have to print them right now
        $stmtsAndTokens = $this->rectorParser->parseFileContentToStmtsAndTokens($file->getOriginalFileContent());
        $oldStmts = $stmtsAndTokens->getStmts();
        $oldTokens = $stmtsAndTokens->getTokens();

        // PATH TO SAVE RESULTS CACHE BASED ON FILE CONTENT
        $cachePath = __DIR__ . '/cache/' . md5_file($file->getFilePath()) . '.json';
        if (!file_exists(__DIR__ . '/cache')) {
            mkdir(__DIR__ . '/cache', 775, true);
        }

        // IF IT HAS ALREADY BEEN ANALYZED BEFORE YOU JUST GET THE TOKENS IN THE CACHE
        if (file_exists($cachePath)) {

            // NOTE: I WASN'T ABLE TO DESERIALIZE CORRECTLY BUT THERE MUST BE A WAY TO DO THIS
            $newStmts = (new \PhpParser\JsonDecoder)->decode(file_get_contents($cachePath));

        } else {

            // OTHERWISE, PERFORM THE NORMAL ANALYSIS AND SAVE IT IN CACHE FOR NEXT TIME
            $newStmts = $this->nodeScopeAndMetadataDecorator->decorateNodesFromFile($file->getFilePath(), $oldStmts);
            file_put_contents(
                $cachePath,
                json_encode(array_map(function ($node) {
                    return $node->jsonSerialize();
                }, $newStmts))
            );
        }

        $file->hydrateStmtsAndTokens($newStmts, $oldStmts, $oldTokens);
    }
staabm commented 1 month ago

without a minimal reproducer of the slow analysis there is nothing we can do

TomasVotruba commented 1 month ago

Closing for reasons mentioned by @staabm .

I think we've now a good balance between cache and stability :+1: