qossmic / deptrac

Keep your architecture clean.
https://qossmic.github.io/deptrac
MIT License
2.61k stars 135 forks source link

Improve AST parser performance #1158

Open patrickkusebauch opened 1 year ago

patrickkusebauch commented 1 year ago

During a Slack conversation with Rubén Rubio, we found out that Deptrac was the slowest quality gate in his CI pipeline. Even slower than running both PHPStan and Psalm together. This should not be the case. We should be faster than either of these tools. We do significantly less work. All 3 of us are parsing AST (should take about the same), but then we only resolve dependencies, while Psalm and PHPStan do a lot more work resolving types and applying rules on those.

Some numbers that came from a debug build of deptrac with perf counters:

image

AstMap created in 49.504072 seconds
start emitting dependencies "ClassDependencyEmitter"
end emitting dependencies in 0.262179 seconds
start emitting dependencies "UsesDependencyEmitter"
end emitting dependencies in 0.091963 seconds
start flatten dependencies
end flatten dependencies in 0.732068 seconds
formatting dependencies.

Obviously, the most expensive part is the AST parsing. We should look at how it is written, if it can be parallelized and if we can take some of NikicPHPParser's performance suggestions into consideration.

patrickkusebauch commented 1 year ago

Slack thread link for reference: https://symfony-devs.slack.com/archives/C036EU1GZS9/p1681985816572369

ilnytskyi commented 1 year ago

I have been working with NikicPhpParser for my project similar to deptrac before I discovered deptrac :) I solved similar problem this way:

  1. Collect files from target dir. (just scandir, no symfony components). But, I think it might be negligible
  2. Get number of CPUs with (int)nproc
  3. array_chunk found files by estimated number of chunk based on cpus available
  4. prepared batch callbacks for a process manager class similar to https://github.com/magento/magento2/blob/2.4-develop/app/code/Magento/Indexer/Model/ProcessManager.php that leverages https://www.php.net/manual/en/function.pcntl-fork.php
  5. At this point the only problem was to store data, as php does not mutate objects between processes
  6. I tried to use tmp file and later switched to sockets between parent and child processes to use as IPC
  7. at this point I reconstructed dependency tree form IPC text storage and proceeded with post processing and formatters

The results were stunning.

but in deptrac you have this problem only for the first run, then .deptrac.cache is used.

It looks like you could do something similar for

\Qossmic\Deptrac\Core\Ast\AstLoader::createAstMap

and here

\Qossmic\Deptrac\Core\Ast\Parser\NikicPhpParser\NikicPhpParser::parseFile
\Qossmic\Deptrac\Core\Ast\Parser\Cache\AstFileReferenceFileCache::set

data should be written into a socket or directly into .deptrac.cache then reloaded.

If there are cpus available the first run would be faster on Linux systems. If child process files, the dependency tree would be broken :( (never happened to me) Might introduce unexpected behavior for custom event observers when they collect data into a singleton object to use later; the singleton will be emtpy so users must be aware of using IPC for such cases.

To me deptrac works just fine. A few minutes pipeline for static tests is acceptable. But maybe someone else would like to work that faster, especially in case of git hooks prepush/precommit integrations, then the time is valuable.

patrickkusebauch commented 1 year ago

In CI you customarily do not have the cache file and do the whole analysis. Hence the times in the screenshot. I hope to pick the brains of developers of other SA tools like PHPStan and Rector, as they are both Czech (same as me) and easily reachable in some of the Czech PHP Slack communities.

Regardless, thanks for the tips, they might come in handy to anyone looking to implement this.

patrickkusebauch commented 1 year ago

Also, if you have your project public somewhere, I would love to take a look. There might be some useful gems there to integrate. :slightly_smiling_face:

rubenrubiob commented 1 year ago

I also dug a little bit into Deptrac analysis. As @ilnytskyi said, the only thing I found to improve the performance was to parallelize the analysis. I did not implement anything, but the only solution I found was to use pcntl_fork.

I used this strategy 10 years ago in some projects and the performance was huge: from running out of resources after hours of executing to finish in just some minutes. In my case, I used shared memory with the shmop_* functions, something like this: https://www.php.net/manual/en/function.pcntl-fork.php#115855

The thing is that pcnt_fork is only available on Linux, not on Mac nor Windows. And I think that most Docker images do not include the extension unless it is explicitly installed. Even though, I guess it is worth trying it.

On the other hand, I solved my performance problem by not including the vendor folder, like they explain here: https://github.com/qossmic/deptrac/issues/506#issuecomment-790677433 I think it is a common use case, so maybe it would be useful to add a section in the documentation explaining how to correctly include vendor dependencies.

smoench commented 1 year ago

Just one minor improvement could be: removing the is_file here, at least for the parser. As far as I remember we are passing the parser only "valid" files and the false from the file_get_contents would be enough when there is something wrong. This would save some I/O.

Another one for the cache would be switching the hash algo. I came across this PR from rector. They switched to the algo xxh128 (since PHP 8.1) and linked following Benchmark. This means when we are switching in this file sha1_file to hash_file('xxh128', ...) we would gain some performance improvements.

staabm commented 1 year ago

If you have a workload somewhere which reproduces the slowness I would be happy to help and profile it

patrickkusebauch commented 1 year ago

@staabm https://github.com/rubenrubiob/blog-src/actions/runs/4731655978/jobs/8396874015, https://github.com/rubenrubiob/blog-src/actions/runs/4796137317/jobs/8531580856

staabm commented 1 year ago

oh my.. if I would be interessted in performance, I would not run my static analysis tooling thru docker ;-).


running it locally on my mac pro m1 takes 0.169secs (without docker)

blog-src git:(main) ✗ time vendor/bin/deptrac analyse --config-file=hexagonal-layers.depfile.yaml --report-uncovered --fail-on-uncovered --no-cache
 51/51 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓] 100%

 -------------------- ----- 
  Report                    
 -------------------- ----- 
  Violations           0    
  Skipped violations   0    
  Uncovered            0    
  Allowed              222  
  Warnings             0    
  Errors               0    
 -------------------- ----- 

vendor/bin/deptrac analyse --config-file=hexagonal-layers.depfile.yaml     0.13s user 0.04s system 97% cpu 0.169 total

I wonder why the CI pipeline scans so much files but locally only 51 are scanned? I guess my command is wrong?

patrickkusebauch commented 1 year ago

I would not say people care about performance as much as it does not compute to me why we take longer than PHPStan and Psalm combined. It was me who made it an issue. deptrac should not be the slowest link in the CI pipeline, that is just not a good user experience.

staabm commented 1 year ago

Yeah, I was joking. If I can reproduce it locally - without docker - I can look into it

gennadigennadigennadi commented 1 year ago

@staabm it's also me, with my setup it also only parses 51 files(in docker) but more than 6000 through the github action. Maybe it's not a deptrac bottleneck but a github action config problem?

A small improvement for the pipeline would be to use the flag --no-cache, this way there will no writes made by deptrac.

rubenrubiob commented 1 year ago

Sorry, I did not see the comments with the analysis. I improved Deptrac's configuration to only scan src files. You can see the performance degradation mentioned on the first comment if you include vendor in the paths to analyze by Deptrac.

gennadigennadigennadi commented 1 year ago

I did some profiling my self, but I'm not that experienced in it. But what i found out is, that one big performance penalty, beside the AST parsing, is builing the symfony service container. I'm curiois to see your results @staabm .

gennadigennadigennadi commented 1 year ago

Sorry, I did not see the comments with the analysis. I improved Deptrac's configuration to only scan src files. You can see the performance degradation mentioned on the first comment if you include vendor in the paths to analyze by Deptrac.

Okay, I see. I think we should work on the performance of deptrac in the future. If I got it right the screenshot @patrickkusebauch shared was based on the deptrac-config which parsed the whole vendor dir, right? But the other tools just parsed the src?

rubenrubiob commented 1 year ago

Sorry, I did not see the comments with the analysis. I improved Deptrac's configuration to only scan src files. You can see the performance degradation mentioned on the first comment if you include vendor in the paths to analyze by Deptrac.

Okay, I see. I think we should work on the performance of deptrac in the future. If I got it right the screenshot @patrickkusebauch shared was based on the deptrac-config which parsed the whole vendor dir, right? But the other tools just parsed the src?

Exactly, the first screenshot was taken when Deptrac scanned the vendor folder. And yes, the other tools only scan the src folder. It is as I said in a previous comment:

On the other hand, I solved my performance problem by not including the vendor folder, like they explain here: #506 (comment) I think it is a common use case, so maybe it would be useful to add a section in the documentation explaining how to correctly include vendor dependencies.