Caching between runs for better performance

JDGrimes commented 9 years ago

First let me thank you for this great tool. :+1:

I've been using this on my PHP projects, and I've found that it can take a while to sniff the code, especially on larger projects with complex configurations. Performance will naturally be determined largely by how well the sniffs used are written. However, I think that performance could be increased by caching the hash signatures of the files being sniffed. Then only those files which have changes since the last sniff was conducted would need to be sniffed (there are some caveats which I'll get to in a moment). This wouldn't improve the performance of the initial sniffing (and might even degrade it slightly), but would drastically improve performance for latter sniffings.

As I noted above, there are some caveats:

The cache needs to be invalidated when the configuration changes. This would be when the XML config file is edited, for example, and possibly at other times as well. This could be facilitated by saving a hash signature of these configuration settings to the cache, and checking whether this matches the current configuration when the cache is being loaded.
If a file has errors, the errors still need to get reported on subsequent sniffings even if the file hasn't changed. This could be done by not caching files with errors, or by caching a list of the errors found and replaying them to the user without sniffing those files.

There are probably other things I haven't thought of, maybe regarding interactive mode, reports, or automatic fixing, all of which I am unfamiliar with. And there would probably need to be an easy way for the user to bypass the cache as needed.

There are probably also other things that could be cached between runs on a project as well.

Exactly how the cache is saved is up to you. I was thinking of a .phpcs-cache in the root of the project being sniffed that would contain the cache represented as a JSON object.

If this is something that you think could be done, I'd be happy to work up a PR if you'll give me a little guidance on how you'd like this implemented.

gsherwood commented 9 years ago

I am still undecided as to if PHPCS itself should be doing things like this or if a wrapper script should be used, like what is done for hooking into a VCS.

Of course, it's possible, and probably fairly easy to get going, but I would not introduce it into the current 2.x versions regardless. I'd start with the 3.0 version, which is still in heavy development here: https://github.com/squizlabs/PHP_CodeSniffer/tree/3.0

How it is done is 1 of 2 options: something in the core and some command line values, or a new script like the SVN pre-commit hook or phpcbf. Either way, it's going to be easier/cleaner to do in the 3.0 version because of the refactoring I've done there, but I'm not ready to accept any new features on that branch at the moment.

JDGrimes commented 9 years ago

Either way, it's going to be easier/cleaner to do in the 3.0 version because of the refactoring I've done there, but I'm not ready to accept any new features on that branch at the moment.

When you are ready, let me know, and if I can I'll try to make up a PR.

gsherwood commented 9 years ago

When you are ready, let me know, and if I can I'll try to make up a PR.

We can start talking about it now. The most important bit is deciding on an implementation before starting work.

Version 3 has a FileList class that just creates a list of files from a file system location. With some changes, different types of file lists could be provided to the core processing code to generate their file lists from various sources:

The current one, which grabs a list of files from the file system
An extension to the first, which only includes files that have been modified since they were last checked (cache file required)
An VCS-based (or even Git specific) extension to the first, which only includes files that have been modified locally (command needs to be run to get the list)

Or, 2 and 3 could be implemented using some sort of filter class, which is passed to the main FileList. I think I prefer this than extending a base class, but I haven't thought about it long enough to be 100% sure.

Interested to hear your thoughts.

gsherwood commented 9 years ago

One more thing to add. I think it would be good if people could specify their own filters (assuming we go that way) on the command line just like they can do for reports. So a command like --filter=/path/to/my/filter.php should be possible, as well as built-in filter types like --filter=modified and --filter=git (whatever the names are).

JDGrimes commented 9 years ago

I think the filters idea is a good one. I don't think it is a complete solution though, because there are probably other things which could be cached as well. For example, it might be beneficial to cache the parsed rulesets.

JDGrimes commented 9 years ago

One possible drawback to filters versus child classes, is that the FileList class would still be traversing the whole directory unnecessarily (e.g., when using the git filter). Or were you thinking it would be implemented in a way that would avoid that?

aik099 commented 9 years ago

Maybe the RecursiveDirectoryIterator can be combined with filter provided RecursiveFilterIterator sub-class to avoid scanning sub-folder of a particular folder, when filter iterator decides that it's not worth scanning.

gsherwood commented 9 years ago

One possible drawback to filters versus child classes, is that the FileList class would still be traversing the whole directory unnecessarily (e.g., when using the git filter). Or were you thinking it would be implemented in a way that would avoid that?

I was thinking more about file modification times more than anything else, which really needs all the normal recursive directory iterator stuff, with an added check (once a file path is found) to check if it needs rechecking.

But finding files is very fast, so I don't see why the same process wont work for Git. You could use the output from a git command and parse the paths in the output to support specified files. directories and the local flag. But it might just be easier to let the recursive scan happen and then match the found file paths to a list of ones found by git.

You don't always want every modified or uncommitted file. Even though there is a cache of the ones that are modified, you may want to limit that by file type, extension, path etc. I think it would be much easier if FileList was in charge of finding candidate files based on that logic, and a filter kicked to limit the candidates based on the filter logic.

If could do this after the candidates have all been found, or during the recursive scan. After is pretty easier, but during might be more efficient. If, for example, a directory does not contain any uncommitted files, you could have the FileList skip that dir and continue with the rest. A filter checking the modification times (or hashes) of files couldn't do that, but that's not a big deal.

gsherwood commented 9 years ago

One thing I actually didn't address in my previous comments was the initial request :smile: I got carried away talking about other related feature requests that I've received.

The filters are good for limiting the things that need to be checked, but we do need to inject the error reports for files that are being skipped from checking (due to cache) but the user still requested to have a report generated for.

In this case, a filter might still be useful as it can block the checking of a file and return the report instead, but I think that would take a fair bit of refactoring in the code, and might make things worse.

What would be better is to just implement this functionality in the LocalFile class, which is loaded with a file path and asked to process (tokenize and check) itself by the Runner. Instead of always choosing to process, it could instead load itself with a set of cached error messages if the file has not changed on disk. The DummyFile class (used mainly for STDIN) would not include this check as it doesn't have a file system location.

So a filter can kick in to limit the files to process, and the file itself could include a hash or modification time check to determine if it needs to be processed again. Two features.

It feels like each checked file should have its own cache so that filters and command line arguments don't get in the way of each other, but that would create a lot of files. Instead, PHPCS might need a Cache handler somewhere where files can register caches with a particular key. Same result, but a single cache file instead of hundreds. The cache handler could be responsible for maintaining the overall state of the run (standard and options used) and so could keep multiple caches if needed.

gsherwood commented 9 years ago

I've put together a quick implementation of a hash-based cached based on my comment above. It's missing a lot of features, but I wanted to see if the implementation would work and what the results would be like. Obviously, a pretty big improvement in performance, for really minimal code.

This is the first run, with no cache file:

$ bin/phpcs --report=summary --standard=PHPCS src -lv
Registering sniffs in the PHP_CodeSniffer standard... DONE (55 sniffs registered)
Creating file list... DONE (5 files in queue)
Changing into directory .../PHP_CodeSniffer/src
Processing Config.php [PHP => 8287 tokens in 1156 lines]... DONE in 367ms (0 errors, 67 warnings)
Processing Fixer.php [PHP => 4533 tokens in 695 lines]... DONE in 185ms (1 errors, 27 warnings)
Processing Reporter.php [PHP => 2564 tokens in 386 lines]... DONE in 101ms (9 errors, 8 warnings)
Processing Ruleset.php [PHP => 7394 tokens in 997 lines]... DONE in 337ms (0 errors, 31 warnings)
Processing Runner.php [PHP => 3708 tokens in 525 lines]... DONE in 161ms (0 errors, 15 warnings)

PHP CODE SNIFFER REPORT SUMMARY
-------------------------------------------------------------------------------------
FILE                                                                 ERRORS  WARNINGS
-------------------------------------------------------------------------------------
.............................../PHP_CodeSniffer/src/Config.php       0       67
.............................../PHP_CodeSniffer/src/Fixer.php        1       27
.............................../PHP_CodeSniffer/src/Reporter.php     9       8
.............................../PHP_CodeSniffer/src/Ruleset.php      0       31
.............................../PHP_CodeSniffer/src/Runner.php       0       15
-------------------------------------------------------------------------------------
A TOTAL OF 10 ERRORS AND 148 WARNINGS WERE FOUND IN 5 FILES
-------------------------------------------------------------------------------------

Time: 1.25 secs; Memory: 20.25Mb

This is the second run, with the cache file in place:

$ bin/phpcs --report=summary --standard=PHPCS src -lv
Registering sniffs in the PHP_CodeSniffer standard... DONE (55 sniffs registered)
Creating file list... DONE (5 files in queue)
Changing into directory .../PHP_CodeSniffer/src
Processing Config.php [loaded from cache]... DONE in 0ms (0 errors, 67 warnings)
Processing Fixer.php [loaded from cache]... DONE in 0ms (1 errors, 27 warnings)
Processing Reporter.php [loaded from cache]... DONE in 0ms (9 errors, 8 warnings)
Processing Ruleset.php [loaded from cache]... DONE in 0ms (0 errors, 31 warnings)
Processing Runner.php [loaded from cache]... DONE in 0ms (0 errors, 15 warnings)

PHP CODE SNIFFER REPORT SUMMARY
-------------------------------------------------------------------------------------
FILE                                                                 ERRORS  WARNINGS
-------------------------------------------------------------------------------------
.............................../PHP_CodeSniffer/src/Config.php       0       67
.............................../PHP_CodeSniffer/src/Fixer.php        1       27
.............................../PHP_CodeSniffer/src/Reporter.php     9       8
.............................../PHP_CodeSniffer/src/Ruleset.php      0       31
.............................../PHP_CodeSniffer/src/Runner.php       0       15
-------------------------------------------------------------------------------------
A TOTAL OF 10 ERRORS AND 148 WARNINGS WERE FOUND IN 5 FILES
-------------------------------------------------------------------------------------

Time: 73ms; Memory: 8.5Mb

Running over the whole PHPCS dir is a difference between 22.13secs and 375ms, so this is good. But the cache file itself (which I'm json pretty printing at the moment) is 7.2M, which isn't that great. If I turn off pretty printing, it comes down to 1.5M, but is now unreadable, so a different format could even be chosen if it ends up being faster. Still, I like JSON.

I'll commit what I have after a bit more cleanup and we can take things from there.

gsherwood commented 9 years ago

I've pushed some commits for this. The main one is this https://github.com/squizlabs/PHP_CodeSniffer/commit/e5cc0abe93a111ab745a04d267c99de73292260e

But forgot unit testing, so committed these 2 fixes as well: https://github.com/squizlabs/PHP_CodeSniffer/commit/f558de564610b7c00e22589cbb136113833244b2 https://github.com/squizlabs/PHP_CodeSniffer/commit/1471a7833d0ed6035a89b5fae9e11d4afe148730

If you use the --cache command line argument, PHPCS will write a .phpcs.xxxxxxxxxxxx.cache file into the current directory (where xxxxxxxxxxxx is a hash representing the config of the run) and subsequent runs will use the data within if the file has not changed.

If you run over part of the code base in one run, and another part during another run, but use the same config, the same cache file will be used (it will just get bigger). Similarly, if you run over your entire code base and cache everything, you can then do another run limiting the files to check and the cache file will still be used.

I haven't committed anything to do with filtering of the file list.

This is still pretty dirty, so I'd appreciate any testing that anyone can do, and ideas for how to make things better.

One of the decision I had to make was where to put the cache files. I decided on the current working directory instead of the temp dir for 2 reasons (1 good reason, 1 stupid reason):

If you are running PHPCS over multiple code bases with the same config, it would end up with a massive cache file containing every file ever checked with that config. The file could get massive, and it will be loading file info that you don't need, so you'll blow the memory limit.
If you really want to, you could always commit the cache file to a repo periodically so that all devs can benefit from the current cache.

Number 1 could be worked around by including the current dir in the file hash, but you lose number 2 by doing that. I'm still not sure what the best place for these files is.

aik099 commented 9 years ago

If you are running PHPCS over multiple code bases with the same config, it would end up with a massive cache file containing every file ever checked with that config. The file could get massive, and it will be loading file info that you don't need, so you'll blow the memory limit.

Agreed, but current working directory might not be be better place, because:

when I'm running php using absolute path from / folder, then / is where cache will be saved
when I'm running phpcs from build server (e.g. Jenkins), then it sets current directory to project directory, which might not be writable at all
way, how phpcs is executed by PhpStorm (on per file basis) probably uses temp folder as current directory, but I haven't checked

If you really want to, you could always commit the cache file to a repo periodically so that all devs can benefit from the current cache.

Yes, they can, but since code base changes all the time the developers need to run phpcs all the time and commit cache file with every commit. And then we can have huge merge conflicts if 2 developers changed some code, which resulted in non-pretty json change (one large line), which diff would probably fail to merge correctly.

Making cache directory configurable (e.g. --cache <cache dir>, directory name is optional) would solve problem, when phpcs is invoked differently to ensure that it's looking into same cache file no matter of invocation way. The PhpStorm however have no idea about this new option and we'll need to wait for PhpStorm 10 release (current release is 8, but 9 is in EAP state) to adopt that option.

aik099 commented 9 years ago

Also the xxxxxxxxxxxx part of cache filename needs to include used sniff names (not just standard name specified in .phpcs file or via --standard command line option).

Any of these changes should invalidate cache:

standard name was changed (e.g. from PSR2 to Squiz)
sniff was added/removed from used standard ruleset.xml
sniff was added/removed in other standard, that is fully (or partially, e.g. sub-dir only) included in used standard
sniff PHP code changed (specified in used standard or included from another standard)

What I believe would be correct cache key detection is:

get to the point in code, where we have:
- all sniffs loaded
- all exclusion rules (per ruleset, per file, per sniff)
- all CLI options (part of them can be read from system-wide PHP_CodeSniffer.conf or .phpcs file)
used that to form cache key

VasekPurchart commented 9 years ago

+1 for configurable cache dir

gsherwood commented 9 years ago

Making cache directory configurable (e.g. --cache , directory name is optional) would solve problem

I'm surprised I didn't include that option in my comment because I had it in my notes. Yes, this is also exactly what I was thinking, and for the exact reasons you've list.

The real question though is if the file should be in the system temp dir instead. So my plan was to make the system temp dir the default file location but allow it to be changed using a CLI arg or config var. Sound ok?

Also the xxxxxxxxxxxx part of cache filename needs to include used sniff names (not just standard name specified in .phpcs file or via --standard command line option).

I can't detect that the PHP code inside a sniff file has changed. But I can hash the parsed ruleset object and include that in the main cache hash in case you are tweaking the ruleset.xml file.

I already include all relevant CLI and config arguments in the cache hash, and do the hashing just before the run is about to commence, so I think the only change required is to look at the parsed ruleset.

If I ever add the ability to change the ruleset used in each directory, life might get hard for the caching system. But I guess I can fix that when it happens.

aik099 commented 9 years ago

The real question though is if the file should be in the system temp dir instead.

It could create problem on developer machine because errors from all projects would end up in same file (by default) and cache reading time for all projects could increase if single large project on developer machine will be cached. But in my particular case I'm specifying absolute path to be scanned to phpcs and therefore I'll end up with different caches per project all stored in temp dir, which is very good.

I can't detect that the PHP code inside a sniff file has changed.

Remembering filesize of the sniff would be enough (faster then doing crc on it), since any significant change to code would result in file size change.

VasekPurchart commented 9 years ago

The real question though is if the file should be in the system temp dir instead.

I think if the default location is the temp dir, then part of the project or analyzed path(s) should be somehow present in the name too

1) to avoid "merging" change together of different projects for the reasons given above 2) to help with manually deleting the cache (from my experience this is needed from time to time in every system using cache, when all other methods fail).

aik099 commented 9 years ago

1) to avoid "merging" change together of different projects for the reasons given above

If only we could easily detect where the project root it. For example in above cases the project folder obviously (to human) is /Users/alex/Projects/project_a/:

phpcs --standard=/path/to/CustomStandard /Users/alex/Projects/project_a/the_file.php
phpcs --standard=/path/to/CustomStandard /Users/alex/Projects/project_a/sub_folder

But computer can't really guess that. If only we could ensure some kind of marker (e.g. .phpcs file in project root) then we can easily do automatic cache splitting into different files for different projects.

By the way is the .phpcs file something new (e.g. added in 2.0 version) or it was there all the time?

gsherwood commented 9 years ago

Knowing the project root is the major problem. You can't just include the path you are checking in the hash or filename because then checking a sub-dir of the project will force a completely new cache to be used, even thought the files themselves have already been checked.

But I really don't know how to determine the project root automatically.

Using the phpcs.xml file is one possible option. If you include that in the root of your project, PHPCS will find it when no standard is given and use it like a ruleset (it sets project defaults and works better in 3.0). The fact that it exists at a particular location means that it is sitting in the project root, or in a sub-project under the main project root (presumably with different rules). This would force the use of a phpcs.xml file for the best possible caching, but we'd still need sensible defaults.

By the way is the .phpcs file something new (e.g. added in 2.0 version) or it was there all the time?

I don't know what file you are talking about.

aik099 commented 9 years ago

I don't know what file you are talking about.

The file, that can be used as per-project PHP_CodeSniffer.conf file. It guess it's phpcs.xml then.

JDGrimes commented 9 years ago

If you really want to, you could always commit the cache file to a repo periodically so that all devs can benefit from the current cache.

Yes, I'd like to have that option.

But I really don't know how to determine the project root automatically.

Using the phpcs.xml file is one possible option.

I think that would be a good default, falling back to the temp directory.

I can't detect that the PHP code inside a sniff file has changed.

Remembering filesize of the sniff would be enough (faster then doing crc on it), since any significant change to code would result in file size change.

I think this would be useful.

I can hash the parsed ruleset object and include that in the main cache hash in case you are tweaking the ruleset.xml file.

Then what happens if you add a rule to the ruleset? I'm guessing that PHPCS will run all of the rules over the files. Would it be possible to detect which rules have been added/changed/removed and only run them?

It might also be nice to have a command that will clean the cache, deleting all cache files that don't match the current configuration. But I'm not sure if that would be possible the way its working currently. It seems to me like right now it will be subject to lots of cache bloat over time as the ruleset changes.

aik099 commented 9 years ago

It might also be nice to have a command that will clean the cache, deleting all cache files that don't match the current configuration.

Yes, this way we can delete the cache without even knowing where it's located.

gsherwood commented 9 years ago

Then what happens if you add a rule to the ruleset? I'm guessing that PHPCS will run all of the rules over the files.

Yep. It would have to do them again.

Would it be possible to detect which rules have been added/changed/removed and only run them?

That would require a completely different setup for the run and some sort of merge code for the resulting checks. The same would be true if you ran PHPCS with a single sniff after running an entire standard. All the errors are there, so the file just needs to filter them based on the sniffs you have asked to filter with. It's possible, but much more complex code. I think we need to get the basics right first, but can then come back to this.

We've also spoken a lot about what happens when rulesets are changing, and sniff PHP code is changing, but this is not what the vast majority of developers are doing. They are running PHPCS over their changing codebase and not over a changing standard. The standard will get updated from time to time, but I think it is really important to not design a system that is painful and/or slow because we want to use caching while we are also tweaking standards.

A command to wipe the cache is a given. If a developer updates the coding standard (maybe they pull a new version) or if they update PHPCS itself, they will need to clear the cache. It would be nice if they didn't have to remember to do that, but it might be necessary. By looking at everything that gets loaded during the run (the autoloader keeps track of this) then we might be able to check if any piece of code has changed. I'll give it a try.

gsherwood commented 9 years ago

I can't detect that the PHP code inside a sniff file has changed

Apparently I can, and have committed that change as well. Now if any of the PHPCS core code changes, or if the loaded sniffs change, or if the code in the loaded sniffs change, the cache is invalidated.

aik099 commented 9 years ago

That's great news.

JDGrimes commented 9 years ago

Thank you for all your hard work on this @gsherwood!

gsherwood commented 9 years ago

Thank you for all your hard work on this @gsherwood!

Thanks for the idea. Not done yet though.

The things to address still are:

Location of cache files
Using different reports loads different report file, so the file hashing does not match
If reports are excluded from file hashing (should be fine because they dont affect the results) then need to store warnings even when a report says they are not needed, only when caching is enabled (higher initial memory usage)
AbstractPatternSniff uses the PHP tokenizer during setup, so it loads tokenizer files while specifying individual sniffs from the same standard may not, causing a file hash mismatch and no cache usage
Tokenizers will affect the way a file is parsed and the errors found, so really should include those in file hash

Possible solutions:

Combination of temp dir, phpcs.xml file location and CLI option
See 5.
Just store them and make sure the reports still filter them out
See 5.
Instead of looking at loaded files, only look at loaded files outside the PHPCS root dir, and any loaded files in the src/Standards dir, to create a dynamic hash. Then create a hash for the contents of all/most (maybe not reports, generators etc) core PHPCS files instead of just the ones that have been loaded so far. This lets you know when the core has changed and lets you see if the used sniffs have changed.

JDGrimes commented 9 years ago

Combination of temp dir, phpcs.xml file location and CLI option

I think that this combination of options would be good. I do have one concern, and that is, I sometimes have the phpcs.xml file symlinked from a different directory. In this case, I'd want the cache file to be stored in the directory the symlink is in, not the directory that it is being symlinked from. But I guess if it didn't work that way I could easily use the CLI option to do what I want.

gsherwood commented 9 years ago

I've committed a change to solves issues 2,3,4 and 5 above. The last thing I need to sort out is cache file storage and clearing. More info about what I ended up doing is in the commit message.

gsherwood commented 9 years ago

Cache files are now stored in the temp dir. See commit above for info.

I still need to add a new option to allow a directory to be specified instead of the system temp dir. If a directory is specified, I wont bother checking for common paths, or using the common path SHA1 in the cache file name, which makes things a little easier.

gsherwood commented 8 years ago

I think I'm going to leave out the option of setting your own cache directory or cache file location until after this feature gets used a bit. Making it more complex is probably not the right thing to do at this stage.

gsherwood commented 8 years ago

I think I'm going to leave out the option of setting your own cache directory or cache file location

I changed my mind on the cache file bit. You can now pass --cache=/path/to/cacheFile to have PHPCS use a specific file for caching. But if the standard changes, or your CLI options change and cause the cache to be invalid, the file will be replaced with the new cache data. When just using --cache you can swap between standards without any data being cleared and without having to specify different cache file locations.

This may become a non-issue if support is added for setting the cache file in a ruleset.xml file using a path relative to the ruleset itself.

squizlabs / PHP_CodeSniffer

Caching between runs for better performance #530