Documentation on usage for large projects

Hvanderwilk commented 3 years ago

I've been trying to configure Psalm in the CI for a large project, around 9M LOC. While I was quite content with the results on even the highest analysis level in terms of issues found, due to performance reasons I might be unable to use it. A full project run takes around 20 minutes, 16G+ of memory and the result cache size is 13G. The bright side is that when the cache was used the analysis was relatively very fast, sometimes around a minute.

Some issues here mention similar sized projects, but not with enough information to get a decent setup myself. Some documentation on how to successfully get this up and running and what to expect in terms of performance would help a lot.

My next attempt would probably be to virtually split the project up into multiple analyses. If I happen to find a working setup, I'm happy to contribute to this documentation. Until then I hope you have some suggestions 🙂

Background on how I'm running it: ./vendor/bin/psalm --memory-limit=20G --long-progress. If I'm correct this means max number of threads is used and caching is on by default.

Configuration (removed some project related names/directories):

<?xml version="1.0"?>
<psalm
        errorLevel="8"
        cacheDirectory="cache/psalm"
        allowStringToStandInForClass="true"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xmlns="https://getpsalm.org/schema/config"
        xsi:schemaLocation="https://getpsalm.org/schema/config vendor/vimeo/psalm/config.xsd"
>
    <projectFiles>
        <directory name="src/*/code" />
        <ignoreFiles>
            <directory name="vendor" />
        </ignoreFiles>
    </projectFiles>

psalm-github-bot[bot] commented 3 years ago

Hey @Hvanderwilk, can you reproduce the issue on https://psalm.dev ?

orklah commented 3 years ago

If you use git, I think the 1 minute long analysis was due to the diff functionnality rather than just the cache.

The diff functionnality checks what changed since last time to analyse only relevant files. It's pretty fast when there is few modifications.

I'd suggest trying without the cache actually. On my project, it uses so much ram and has to write so much I can't even run a full analysis when cache is enabled. Maybe you would have a similar issue.

I'd advise against running only part of your project unless there is no concrete dependancies (i.e. there is an API between the two). Some functionnalities of Psalm will need to have your full project to run correctly (for example, unused code detection)

orklah commented 3 years ago

You can also add reportInfo=false in your psalm config. It's a flag I added to get rid of extra time spent by Psalm to handle issues below your errorLevel (to be able to report them as info if needed). It should be the most efficient on high errorLevel when most of the issues are at lower level (which seems to be your case here)

Please report here if you notice an effect with that flag, it was pretty efficient when doing my tests some time ago but I don't use it often and I lack datas on that.

weirdan commented 3 years ago

If you're using cache make sure you're restoring modification times after git clone in CI. Here's how we do it (Github actions workflow):

      - name: Checkout
        uses: actions/checkout@v2
        with:
          # infinite depth is required for mtime restore to work correctly
          fetch-depth: 0

      # mtime needs to be restored for Psalm cache to work correctly
      - name: Restore mtimes
        uses: weirdan/git-restore-mtime-action@master

      # the way cache keys are set up will always cause a cache miss
      # but will restore the cache generated during the previous run
      # based on partial match
      - name: Retrieve Psalm's cache
        uses: actions/cache@v2
        with:
            # we really need a way for Psalm to tell where cache is located
            path: |
              ~/.cache/psalm
            key: ${{ runner.os }}-psalm-cache-v3-${{ github.sha }}
            restore-keys: |
              ${{ runner.os }}-psalm-cache-v3-

Hvanderwilk commented 3 years ago

The diff functionnality checks what changed since last time to analyse only relevant files. It's pretty fast when there is few modifications.

@orklah I thought this was part of the caching. So this means that running with --no-cache will still use this diff? This still needs to store some information about previous results though I'd imagine. Would be curious on how that works and if this diff information could possibly be shared so it can be used on multiple runners for example.

You can also add reportInfo=false in your psalm config.

@orklah I'll have a try and report back!

If you're using cache make sure you're restoring modification times after git clone in CI. Here's how we do it (Github actions workflow)

@weirdan thanks for the heads up. We use mtime for other caching too and the fetching strategy we use seems to minimize this mostly.

orklah commented 3 years ago

I don't think you can use the diff functionnality without the cache (but you can use cache without the diff functionnality though).

Hvanderwilk commented 3 years ago

More findings:

reportInfo=false speeds it up slightly but not significantly.
I noticed in CI threads default to 1. Increasing threads makes the job hang on scanning files for 15 minutes.

Using cache or sharing psalm cache doesn't seem to be an option for me given both disk size and RAM usage. If the threading would work that would perhaps be a solution.

I'd advise against running only part of your project unless there is no concrete dependancies (i.e. there is an API between the two). Some functionnalities of Psalm will need to have your full project to run correctly (for example, unused code detection)

Coming back to that, I'm curious which other functionalities would suffer. I guess there is some overhead when splitting the analysis up, but that might be worth it if it's just dead code (more interested in a fast integration than that)

orklah commented 3 years ago

Did you try without cache with --no-cache (it disable not only cache reading but also writing so it should be faster than first run but possibly slower for consecutive runs)

As to other functionnalities, I'm thinking about Psalter (but it shouldn't be an issue in CI). There's probably others but none that I can think of

There will be some overhead, for example, Psalm will have to analyse common dependancies twice, things like that.

Here are a few more things you could try:

try running blackfire on the execution to see if there's bottlenecks (in my project we discovered two files with exotic code that were taking minutes each to analyse)
Alternatively, if you can run psalm locally, we could add one or two lines of code in order to output the time psalm takes to analyze each method so you could identify what takes time (and possibly what you could ignore)
in issueHandler in you psalm config, you could add every issue below level 8 with an errorLevel to suppress. That's how psalm did before the integration of levels directly in config. I wonder if explicitly suppressing issues wouldn't be more efficient than reportInfo

Hvanderwilk commented 3 years ago

I've ran it locally with --debug on, doesn't seem to be stuck on any file for a significant amount of time.
Am now indeed using --no-cache speedup is not noticeable.

Will probably experiment with splitting the analysis, trying to enable threading in CI and the other suggestions you made. Thanks for the suggestions thus far! Will report back here with any additional findings.

orklah commented 3 years ago

* Am now indeed using `--no-cache` speedup is not noticeable.

Not even in the first run? That seems weird to me, encoding and writing 13Go in multiple files should be a big part of the execution... removing it should be noticeable.

Hvanderwilk commented 3 years ago

I'll do a proper test when investigating further. Meant more that it didn't seem to halve the time 🙂

muglug commented 3 years ago

Is it 9M lines of code just in your src directory?

michabbb commented 3 years ago

it "can" make a difference, using php8+jit for checking, even if the code is written for php7. in my special case, I was able to cut the time in half. but of course, that depends on the project itself. I just wanted to mention it because in large projects it might have a bigger effect than in a small one.

Hvanderwilk commented 3 years ago

Is it 9M lines of code just in your src directory?

Basically, yes. However, the project depends on generated files. I've limited the actual analysis to human produced code which is around ~1.5M LOC. I've also verified the analysis is limited to these files and the --debug option indicates that the generated files are not deep scanned either

Hvanderwilk commented 3 years ago

Ran some test on a local machine. The results are as reported by time (the first is the actual time in seconds). JIT indeed seems like an improvement!

Configuration	Result	Extra info
`php@7.4 ./vendor/bin/psalm --memory-limit=20G --long-progress --no-cache --threads=1`	`3165.12s user 55.35s system 92% cpu 58:13.48 total`
`php@8.0 ./vendor/bin/psalm --memory-limit=20G --long-progress --no-cache --threads=1`	`2178.50s user 41.64s system 94% cpu 39:00.33 total`	Didn't use JIT because forgot it wasn't on by default
`php@8.0 -dopcache.enable_cli=1 -dopcache.jit_buffer_size=100M -dopcache.jit=1255 ./vendor/bin/psalm --memory-limit=20G --long-progress --no-cache --threads=1`	`1305.79s user 34.35s system 95% cpu 23:28.89 total`
`php@8.0 -dopcache.enable_cli=1 -dopcache.jit_buffer_size=100M -dopcache.jit=1255 ./vendor/bin/psalm --memory-limit=20G --long-progress --threads=1`	`1267.71s user 146.77s system 87% cpu 27:01.90 total`
`php@8.0 -dopcache.enable_cli=1 -dopcache.jit_buffer_size=100M -dopcache.jit=1255 ./vendor/bin/psalm --memory-limit=20G --long-progress --threads=1`	`13.01s user 5.17s system 85% cpu 21.289 total`	Run using cache, no files changed.
`php@8.0 -dopcache.enable_cli=1 -dopcache.jit_buffer_size=100M -dopcache.jit=1255 ./vendor/bin/psalm --memory-limit=20G --long-progress --threads=1`	Fatalled, 20G exhausted* + `61.51s user 29.71s system 66% cpu 2:16.57 total`	~200 file diff
`php@8.0 -dopcache.enable_cli=1 -dopcache.jit_buffer_size=100M -dopcache.jit=1255 ./vendor/bin/psalm --memory-limit=25G --long-progress --threads=1`	`196.10s user 43.97s system 77% cpu 5:10.50 total`	~200 file diff

PHP Fatal error: Allowed memory size of 21474836480 bytes exhausted (tried to allocate 20480 bytes) in ./vendor/vimeo/psalm/src/Psalm/Internal/Codebase/Populator.php on line 1114 Fatal error: Allowed memory size of 21474836480 bytes exhausted (tried to allocate 20480 bytes) in ./vendor/vimeo/psalm/src/Psalm/Internal/Codebase/Populator.php on line 1114 php -dopcache.enable_cli=1 -dopcache.jit_buffer_size=100M -dopcache.jit=1255 61.51s user 29.71s system 66% cpu 2:16.57 total

orklah commented 3 years ago

Cool! Thanks for the detailed report!

Based on your examples, if we exclude the first run, it seems like consecutive runs are pretty fast. What is your acceptable threshold?

It could be interesting to check why the threads option takes so long, it could be a big lever.

muglug commented 3 years ago

All I can really offer is my experience running Psalm at Vimeo, where we have a much smaller codebase – 1M LOC, with an another 2M in the vendor dir (and it takes roughly 95 seconds uncached)

PHP probably isn't the best tool for scanning very large codebases – VK's Noverify, written in Go, is an order of magnitude faster, but it has considerably more bugs and lacks a lot of features.

muglug commented 3 years ago

I've occasionally gone on performance-focussed missions, and I don't think there's a significant amount of extra performance that can be squeezed out of Psalm itself while staying in the PHP runtime.

I'd love to be proven wrong, though!

muglug commented 3 years ago

I noticed in CI threads default to 1. Increasing threads makes the job hang on scanning files for 15 minutes.

What's your CI environment?

Hvanderwilk commented 3 years ago

It could be interesting to check why the threads option takes so long, it could be a big lever.

Will be looking into this! Acceptable would be around 4-5 min runs, so usability would also be influences by how many entire-cache invalidations are going on.

PHP probably isn't the best tool for scanning very large codebases

Agreed, sadly PHP at the same time doesn't punish large code bases with compilation time 🙂. However, like @orklah indicates if it can be parallelized this might not be such a big issue. Furthermore, your codebase is also quite large so there might still be some scaling issues there.

Also I think Psalm has a lot of features I'd be interested in and I don't think the issues are unresolvable, either through computing resources or optimisations.

What's your CI environment?

GitLab, haven't actually tested to what number of threads it defaults, just noticed so in the code.

muglug commented 3 years ago

GitLab, haven't actually tested to what number of threads it defaults, just noticed so in the code.

I think the ultimate solution might be to investigate more beefy environments. I've seen reports that GitHub's default environment is reasonably puny, and GitLab sounds like it has the same issues.

At Vimeo our CI environments have lots of cores, allowing Psalm to run with many threads/processes which makes it much faster (around 5x) than with --threads=1

Hvanderwilk commented 3 years ago

I think the ultimate solution might be to investigate more beefy environments.

I had already been running on pretty good environments (32GB RAM, 16 vCPUs). There's some problem during the scanning files phase when using more threads. Debugging it locally it seems they stay at the Collecting data from forked scanner process, for quite a long time before continuing, even though all forked processes reported that message. I'll try to do some speed tests / figure out if that's actually the cause of the delay.

UPDATE:

Interestingly, running on a single thread takes 40 mins on the CI runner. Not specifying any threads takes 13. I had tried specifying various number of threads but all seemed very slow

Hvanderwilk commented 3 years ago

Final update for now, successfully integrated psalm into our CI. Setup is as follows:

Psalm runs in 16 processes.
Caching is on, runners need to populate cache around once a day which takes around 10 mins.
Subsequent runs with cache take around 3 minutes and up to 8 GB RAM.
The final reason this was possible was because I switched the php version in psalm.xml to 8.0. For some reason this reduced the time from >25 mins to what it currently is, without even running on php 8.0 for analyzing.

Wont have time to contribute to documentation, though I guess this issue is self-documenting.

Thanks for the help!

umulmrum commented 3 years ago

I have a similar problem, I hope it's close enough to justify adding it here.

The code base I'm working in is definitely smaller than OP's (<1M LOC, limited scope to ~500k LOC for investigation) but I'm also experiencing huge memory consumption of >8 GiB with the process getting killed if memory limit is removed. With the help of the comments here I could track the issue down to the problem described here: https://github.com/vimeo/psalm/issues/5327#issuecomment-800053497

To find the issue I followed these steps:

Run Psalm with --debug to get information on which files are analyzed.
Watch the progress to see which file takes a lot of time or which is the last file before the crash.
Look at the file to find suspicious code, comment it out and retry.

While this worked in my case, it feels way too close to trial-and-error. So my question is if there is a better approach. If yes, could you add it to the docs? If no, could you consider adding additional debugging capabilities? Not sure what to suggest here, maybe it's enough to add a timestamp and memory consumption to each debug statement output?

Thank you for your work! :-)

dbalabka commented 2 years ago

Ran some test on a local machine. The results are as reported by time (the first is the actual time in seconds). JIT indeed seems like an improvement!

@Hvanderwilk, I would suggest running Psalm analysis with enabled OpCache and disabled JIT because OpCache itself provides multiple code optimizations. Therefore it is not clear from the results are these performance improvements are provided by OpCache or JIT.

php@8.0 -dopcache.enable_cli=1 -dopcache.jit=0 ./vendor/bin/psalm --memory-limit=20G --long-progress --no-cache --threads=1

faizanakram99 commented 1 year ago

If you're using cache make sure you're restoring modification times after git clone in CI. Here's how we do it (Github actions workflow):

      - name: Checkout
        uses: actions/checkout@v2
        with:
          # infinite depth is required for mtime restore to work correctly
          fetch-depth: 0

      # mtime needs to be restored for Psalm cache to work correctly
      - name: Restore mtimes
        uses: weirdan/git-restore-mtime-action@master

      # the way cache keys are set up will always cause a cache miss
      # but will restore the cache generated during the previous run
      # based on partial match
      - name: Retrieve Psalm's cache
        uses: actions/cache@v2
        with:
            # we really need a way for Psalm to tell where cache is located
            path: |
              ~/.cache/psalm
            key: ${{ runner.os }}-psalm-cache-v3-${{ github.sha }}
            restore-keys: |
              ${{ runner.os }}-psalm-cache-v3-

@weirdan

hey bruce, how much time does it save in total compared to --no-cache (considering fetch depth 0 and settting up actions/cache will also take sometime). Our psalm checks finish within 2 minutes (and 10 seconds), so wondering if setting up cache is worth it.

Also, could u share the workflow for latest psalm and actions/cache versions (if it isn't too much trouble). Thank you

weirdan commented 1 year ago

how much time does it save in total

It depends on the amount (and nature) of changes in the commit being built. Sometimes psalm can finish in a couple of seconds.

Also, could u share the workflow for latest psalm and actions/cache versions

The workflow snippet was taken from the latest version we had at the time. Perhaps we need to update it.

ro0NL commented 1 year ago

all this should be documented, ref https://github.com/vimeo/psalm/discussions/9911

apparently we should "Restore mtimes", but fetch-depth: 0 seems to be a no-go (+1m)

also not sure if https://github.com/chetan/git-restore-mtime-action/compare/master...weirdan:git-restore-mtime-action:master is relevant, i'd prefer upstream

i cant pinpoint any of this if it works properly

BafS commented 10 months ago

I think the ultimate solution might be to investigate more beefy environments.

I had already been running on pretty good environments (32GB RAM, 16 vCPUs). There's some problem during the scanning files phase when using more threads. Debugging it locally it seems they stay at the Collecting data from forked scanner process, for quite a long time before continuing, even though all forked processes reported that message. I'll try to do some speed tests / figure out if that's actually the cause of the delay.

UPDATE:
* Interestingly, running on a single thread takes 40 mins on the CI runner. Not specifying any threads takes 13. I had tried specifying various number of threads but all seemed very slow

We have the same issue when using multiple threads. Using the --debug flag we end up having a Collecting data from forked scanner process and then nothing, it just exit with a failure. The workaround is to not use threads.

vimeo / psalm

Documentation on usage for large projects #4973