Closed JDGrimes closed 8 years ago
I am still undecided as to if PHPCS itself should be doing things like this or if a wrapper script should be used, like what is done for hooking into a VCS.
Of course, it's possible, and probably fairly easy to get going, but I would not introduce it into the current 2.x versions regardless. I'd start with the 3.0 version, which is still in heavy development here: https://github.com/squizlabs/PHP_CodeSniffer/tree/3.0
How it is done is 1 of 2 options: something in the core and some command line values, or a new script like the SVN pre-commit hook or phpcbf. Either way, it's going to be easier/cleaner to do in the 3.0 version because of the refactoring I've done there, but I'm not ready to accept any new features on that branch at the moment.
Either way, it's going to be easier/cleaner to do in the 3.0 version because of the refactoring I've done there, but I'm not ready to accept any new features on that branch at the moment.
When you are ready, let me know, and if I can I'll try to make up a PR.
When you are ready, let me know, and if I can I'll try to make up a PR.
We can start talking about it now. The most important bit is deciding on an implementation before starting work.
Version 3 has a FileList class that just creates a list of files from a file system location. With some changes, different types of file lists could be provided to the core processing code to generate their file lists from various sources:
Or, 2 and 3 could be implemented using some sort of filter class, which is passed to the main FileList. I think I prefer this than extending a base class, but I haven't thought about it long enough to be 100% sure.
Interested to hear your thoughts.
One more thing to add. I think it would be good if people could specify their own filters (assuming we go that way) on the command line just like they can do for reports. So a command like --filter=/path/to/my/filter.php
should be possible, as well as built-in filter types like --filter=modified
and --filter=git
(whatever the names are).
I think the filters idea is a good one. I don't think it is a complete solution though, because there are probably other things which could be cached as well. For example, it might be beneficial to cache the parsed rulesets.
One possible drawback to filters versus child classes, is that the FileList class would still be traversing the whole directory unnecessarily (e.g., when using the git filter). Or were you thinking it would be implemented in a way that would avoid that?
Maybe the RecursiveDirectoryIterator
can be combined with filter provided RecursiveFilterIterator
sub-class to avoid scanning sub-folder of a particular folder, when filter iterator decides that it's not worth scanning.
One possible drawback to filters versus child classes, is that the FileList class would still be traversing the whole directory unnecessarily (e.g., when using the git filter). Or were you thinking it would be implemented in a way that would avoid that?
I was thinking more about file modification times more than anything else, which really needs all the normal recursive directory iterator stuff, with an added check (once a file path is found) to check if it needs rechecking.
But finding files is very fast, so I don't see why the same process wont work for Git. You could use the output from a git command and parse the paths in the output to support specified files. directories and the local flag. But it might just be easier to let the recursive scan happen and then match the found file paths to a list of ones found by git.
You don't always want every modified or uncommitted file. Even though there is a cache of the ones that are modified, you may want to limit that by file type, extension, path etc. I think it would be much easier if FileList was in charge of finding candidate files based on that logic, and a filter kicked to limit the candidates based on the filter logic.
If could do this after the candidates have all been found, or during the recursive scan. After is pretty easier, but during might be more efficient. If, for example, a directory does not contain any uncommitted files, you could have the FileList skip that dir and continue with the rest. A filter checking the modification times (or hashes) of files couldn't do that, but that's not a big deal.
One thing I actually didn't address in my previous comments was the initial request :smile: I got carried away talking about other related feature requests that I've received.
The filters are good for limiting the things that need to be checked, but we do need to inject the error reports for files that are being skipped from checking (due to cache) but the user still requested to have a report generated for.
In this case, a filter might still be useful as it can block the checking of a file and return the report instead, but I think that would take a fair bit of refactoring in the code, and might make things worse.
What would be better is to just implement this functionality in the LocalFile class, which is loaded with a file path and asked to process (tokenize and check) itself by the Runner. Instead of always choosing to process, it could instead load itself with a set of cached error messages if the file has not changed on disk. The DummyFile class (used mainly for STDIN) would not include this check as it doesn't have a file system location.
So a filter can kick in to limit the files to process, and the file itself could include a hash or modification time check to determine if it needs to be processed again. Two features.
It feels like each checked file should have its own cache so that filters and command line arguments don't get in the way of each other, but that would create a lot of files. Instead, PHPCS might need a Cache handler somewhere where files can register caches with a particular key. Same result, but a single cache file instead of hundreds. The cache handler could be responsible for maintaining the overall state of the run (standard and options used) and so could keep multiple caches if needed.
I've put together a quick implementation of a hash-based cached based on my comment above. It's missing a lot of features, but I wanted to see if the implementation would work and what the results would be like. Obviously, a pretty big improvement in performance, for really minimal code.
This is the first run, with no cache file:
$ bin/phpcs --report=summary --standard=PHPCS src -lv
Registering sniffs in the PHP_CodeSniffer standard... DONE (55 sniffs registered)
Creating file list... DONE (5 files in queue)
Changing into directory .../PHP_CodeSniffer/src
Processing Config.php [PHP => 8287 tokens in 1156 lines]... DONE in 367ms (0 errors, 67 warnings)
Processing Fixer.php [PHP => 4533 tokens in 695 lines]... DONE in 185ms (1 errors, 27 warnings)
Processing Reporter.php [PHP => 2564 tokens in 386 lines]... DONE in 101ms (9 errors, 8 warnings)
Processing Ruleset.php [PHP => 7394 tokens in 997 lines]... DONE in 337ms (0 errors, 31 warnings)
Processing Runner.php [PHP => 3708 tokens in 525 lines]... DONE in 161ms (0 errors, 15 warnings)
PHP CODE SNIFFER REPORT SUMMARY
-------------------------------------------------------------------------------------
FILE ERRORS WARNINGS
-------------------------------------------------------------------------------------
.............................../PHP_CodeSniffer/src/Config.php 0 67
.............................../PHP_CodeSniffer/src/Fixer.php 1 27
.............................../PHP_CodeSniffer/src/Reporter.php 9 8
.............................../PHP_CodeSniffer/src/Ruleset.php 0 31
.............................../PHP_CodeSniffer/src/Runner.php 0 15
-------------------------------------------------------------------------------------
A TOTAL OF 10 ERRORS AND 148 WARNINGS WERE FOUND IN 5 FILES
-------------------------------------------------------------------------------------
Time: 1.25 secs; Memory: 20.25Mb
This is the second run, with the cache file in place:
$ bin/phpcs --report=summary --standard=PHPCS src -lv
Registering sniffs in the PHP_CodeSniffer standard... DONE (55 sniffs registered)
Creating file list... DONE (5 files in queue)
Changing into directory .../PHP_CodeSniffer/src
Processing Config.php [loaded from cache]... DONE in 0ms (0 errors, 67 warnings)
Processing Fixer.php [loaded from cache]... DONE in 0ms (1 errors, 27 warnings)
Processing Reporter.php [loaded from cache]... DONE in 0ms (9 errors, 8 warnings)
Processing Ruleset.php [loaded from cache]... DONE in 0ms (0 errors, 31 warnings)
Processing Runner.php [loaded from cache]... DONE in 0ms (0 errors, 15 warnings)
PHP CODE SNIFFER REPORT SUMMARY
-------------------------------------------------------------------------------------
FILE ERRORS WARNINGS
-------------------------------------------------------------------------------------
.............................../PHP_CodeSniffer/src/Config.php 0 67
.............................../PHP_CodeSniffer/src/Fixer.php 1 27
.............................../PHP_CodeSniffer/src/Reporter.php 9 8
.............................../PHP_CodeSniffer/src/Ruleset.php 0 31
.............................../PHP_CodeSniffer/src/Runner.php 0 15
-------------------------------------------------------------------------------------
A TOTAL OF 10 ERRORS AND 148 WARNINGS WERE FOUND IN 5 FILES
-------------------------------------------------------------------------------------
Time: 73ms; Memory: 8.5Mb
Running over the whole PHPCS dir is a difference between 22.13secs and 375ms, so this is good. But the cache file itself (which I'm json pretty printing at the moment) is 7.2M, which isn't that great. If I turn off pretty printing, it comes down to 1.5M, but is now unreadable, so a different format could even be chosen if it ends up being faster. Still, I like JSON.
I'll commit what I have after a bit more cleanup and we can take things from there.
I've pushed some commits for this. The main one is this https://github.com/squizlabs/PHP_CodeSniffer/commit/e5cc0abe93a111ab745a04d267c99de73292260e
But forgot unit testing, so committed these 2 fixes as well: https://github.com/squizlabs/PHP_CodeSniffer/commit/f558de564610b7c00e22589cbb136113833244b2 https://github.com/squizlabs/PHP_CodeSniffer/commit/1471a7833d0ed6035a89b5fae9e11d4afe148730
If you use the --cache
command line argument, PHPCS will write a .phpcs.xxxxxxxxxxxx.cache
file into the current directory (where xxxxxxxxxxxx
is a hash representing the config of the run) and subsequent runs will use the data within if the file has not changed.
If you run over part of the code base in one run, and another part during another run, but use the same config, the same cache file will be used (it will just get bigger). Similarly, if you run over your entire code base and cache everything, you can then do another run limiting the files to check and the cache file will still be used.
I haven't committed anything to do with filtering of the file list.
This is still pretty dirty, so I'd appreciate any testing that anyone can do, and ideas for how to make things better.
One of the decision I had to make was where to put the cache files. I decided on the current working directory instead of the temp dir for 2 reasons (1 good reason, 1 stupid reason):
Number 1 could be worked around by including the current dir in the file hash, but you lose number 2 by doing that. I'm still not sure what the best place for these files is.
If you are running PHPCS over multiple code bases with the same config, it would end up with a massive cache file containing every file ever checked with that config. The file could get massive, and it will be loading file info that you don't need, so you'll blow the memory limit.
Agreed, but current working directory
might not be be better place, because:
php
using absolute path from /
folder, then /
is where cache will be savedphpcs
from build server (e.g. Jenkins), then it sets current directory to project directory, which might not be writable at allphpcs
is executed by PhpStorm (on per file basis) probably uses temp folder as current directory, but I haven't checkedIf you really want to, you could always commit the cache file to a repo periodically so that all devs can benefit from the current cache.
Yes, they can, but since code base changes all the time the developers need to run phpcs
all the time and commit cache file with every commit. And then we can have huge merge conflicts if 2 developers changed some code, which resulted in non-pretty json change (one large line), which diff would probably fail to merge correctly.
Making cache directory configurable (e.g. --cache <cache dir>
, directory name is optional) would solve problem, when phpcs is invoked differently to ensure that it's looking into same cache file no matter of invocation way. The PhpStorm however have no idea about this new option and we'll need to wait for PhpStorm 10 release (current release is 8, but 9 is in EAP state) to adopt that option.
Also the xxxxxxxxxxxx
part of cache filename needs to include used sniff names (not just standard name specified in .phpcs
file or via --standard
command line option).
Any of these changes should invalidate cache:
What I believe would be correct cache key detection is:
+1 for configurable cache dir
Making cache directory configurable (e.g. --cache
, directory name is optional) would solve problem
I'm surprised I didn't include that option in my comment because I had it in my notes. Yes, this is also exactly what I was thinking, and for the exact reasons you've list.
The real question though is if the file should be in the system temp dir instead. So my plan was to make the system temp dir the default file location but allow it to be changed using a CLI arg or config var. Sound ok?
Also the xxxxxxxxxxxx part of cache filename needs to include used sniff names (not just standard name specified in .phpcs file or via --standard command line option).
I can't detect that the PHP code inside a sniff file has changed. But I can hash the parsed ruleset object and include that in the main cache hash in case you are tweaking the ruleset.xml file.
I already include all relevant CLI and config arguments in the cache hash, and do the hashing just before the run is about to commence, so I think the only change required is to look at the parsed ruleset.
If I ever add the ability to change the ruleset used in each directory, life might get hard for the caching system. But I guess I can fix that when it happens.
The real question though is if the file should be in the system temp dir instead.
It could create problem on developer machine because errors from all projects would end up in same file (by default) and cache reading time for all projects could increase if single large project on developer machine will be cached. But in my particular case I'm specifying absolute path to be scanned to phpcs and therefore I'll end up with different caches per project all stored in temp dir, which is very good.
I can't detect that the PHP code inside a sniff file has changed.
Remembering filesize of the sniff would be enough (faster then doing crc on it), since any significant change to code would result in file size change.
The real question though is if the file should be in the system temp dir instead.
I think if the default location is the temp dir, then part of the project or analyzed path(s) should be somehow present in the name too
1) to avoid "merging" change together of different projects for the reasons given above 2) to help with manually deleting the cache (from my experience this is needed from time to time in every system using cache, when all other methods fail).
1) to avoid "merging" change together of different projects for the reasons given above
If only we could easily detect where the project root it. For example in above cases the project folder obviously (to human) is /Users/alex/Projects/project_a/
:
phpcs --standard=/path/to/CustomStandard /Users/alex/Projects/project_a/the_file.php
phpcs --standard=/path/to/CustomStandard /Users/alex/Projects/project_a/sub_folder
But computer can't really guess that. If only we could ensure some kind of marker (e.g. .phpcs
file in project root) then we can easily do automatic cache splitting into different files for different projects.
By the way is the .phpcs
file something new (e.g. added in 2.0 version) or it was there all the time?
Knowing the project root is the major problem. You can't just include the path you are checking in the hash or filename because then checking a sub-dir of the project will force a completely new cache to be used, even thought the files themselves have already been checked.
But I really don't know how to determine the project root automatically.
Using the phpcs.xml file is one possible option. If you include that in the root of your project, PHPCS will find it when no standard is given and use it like a ruleset (it sets project defaults and works better in 3.0). The fact that it exists at a particular location means that it is sitting in the project root, or in a sub-project under the main project root (presumably with different rules). This would force the use of a phpcs.xml file for the best possible caching, but we'd still need sensible defaults.
By the way is the .phpcs file something new (e.g. added in 2.0 version) or it was there all the time?
I don't know what file you are talking about.
I don't know what file you are talking about.
The file, that can be used as per-project PHP_CodeSniffer.conf file. It guess it's phpcs.xml
then.
If you really want to, you could always commit the cache file to a repo periodically so that all devs can benefit from the current cache.
Yes, I'd like to have that option.
But I really don't know how to determine the project root automatically.
Using the phpcs.xml file is one possible option.
I think that would be a good default, falling back to the temp directory.
I can't detect that the PHP code inside a sniff file has changed.
Remembering filesize of the sniff would be enough (faster then doing crc on it), since any significant change to code would result in file size change.
I think this would be useful.
I can hash the parsed ruleset object and include that in the main cache hash in case you are tweaking the ruleset.xml file.
Then what happens if you add a rule to the ruleset? I'm guessing that PHPCS will run all of the rules over the files. Would it be possible to detect which rules have been added/changed/removed and only run them?
It might also be nice to have a command that will clean the cache, deleting all cache files that don't match the current configuration. But I'm not sure if that would be possible the way its working currently. It seems to me like right now it will be subject to lots of cache bloat over time as the ruleset changes.
It might also be nice to have a command that will clean the cache, deleting all cache files that don't match the current configuration.
Yes, this way we can delete the cache without even knowing where it's located.
Then what happens if you add a rule to the ruleset? I'm guessing that PHPCS will run all of the rules over the files.
Yep. It would have to do them again.
Would it be possible to detect which rules have been added/changed/removed and only run them?
That would require a completely different setup for the run and some sort of merge code for the resulting checks. The same would be true if you ran PHPCS with a single sniff after running an entire standard. All the errors are there, so the file just needs to filter them based on the sniffs you have asked to filter with. It's possible, but much more complex code. I think we need to get the basics right first, but can then come back to this.
We've also spoken a lot about what happens when rulesets are changing, and sniff PHP code is changing, but this is not what the vast majority of developers are doing. They are running PHPCS over their changing codebase and not over a changing standard. The standard will get updated from time to time, but I think it is really important to not design a system that is painful and/or slow because we want to use caching while we are also tweaking standards.
A command to wipe the cache is a given. If a developer updates the coding standard (maybe they pull a new version) or if they update PHPCS itself, they will need to clear the cache. It would be nice if they didn't have to remember to do that, but it might be necessary. By looking at everything that gets loaded during the run (the autoloader keeps track of this) then we might be able to check if any piece of code has changed. I'll give it a try.
I can't detect that the PHP code inside a sniff file has changed
Apparently I can, and have committed that change as well. Now if any of the PHPCS core code changes, or if the loaded sniffs change, or if the code in the loaded sniffs change, the cache is invalidated.
That's great news.
Thank you for all your hard work on this @gsherwood!
Thank you for all your hard work on this @gsherwood!
Thanks for the idea. Not done yet though.
The things to address still are:
Possible solutions:
- Combination of temp dir, phpcs.xml file location and CLI option
I think that this combination of options would be good. I do have one concern, and that is, I sometimes have the phpcs.xml file symlinked from a different directory. In this case, I'd want the cache file to be stored in the directory the symlink is in, not the directory that it is being symlinked from. But I guess if it didn't work that way I could easily use the CLI option to do what I want.
I've committed a change to solves issues 2,3,4 and 5 above. The last thing I need to sort out is cache file storage and clearing. More info about what I ended up doing is in the commit message.
Cache files are now stored in the temp dir. See commit above for info.
I still need to add a new option to allow a directory to be specified instead of the system temp dir. If a directory is specified, I wont bother checking for common paths, or using the common path SHA1 in the cache file name, which makes things a little easier.
I think I'm going to leave out the option of setting your own cache directory or cache file location until after this feature gets used a bit. Making it more complex is probably not the right thing to do at this stage.
I think I'm going to leave out the option of setting your own cache directory or cache file location
I changed my mind on the cache file bit. You can now pass --cache=/path/to/cacheFile
to have PHPCS use a specific file for caching. But if the standard changes, or your CLI options change and cause the cache to be invalid, the file will be replaced with the new cache data. When just using --cache
you can swap between standards without any data being cleared and without having to specify different cache file locations.
This may become a non-issue if support is added for setting the cache file in a ruleset.xml file using a path relative to the ruleset itself.
First let me thank you for this great tool. :+1:
I've been using this on my PHP projects, and I've found that it can take a while to sniff the code, especially on larger projects with complex configurations. Performance will naturally be determined largely by how well the sniffs used are written. However, I think that performance could be increased by caching the hash signatures of the files being sniffed. Then only those files which have changes since the last sniff was conducted would need to be sniffed (there are some caveats which I'll get to in a moment). This wouldn't improve the performance of the initial sniffing (and might even degrade it slightly), but would drastically improve performance for latter sniffings.
As I noted above, there are some caveats:
There are probably other things I haven't thought of, maybe regarding interactive mode, reports, or automatic fixing, all of which I am unfamiliar with. And there would probably need to be an easy way for the user to bypass the cache as needed.
There are probably also other things that could be cached between runs on a project as well.
Exactly how the cache is saved is up to you. I was thinking of a
.phpcs-cache
in the root of the project being sniffed that would contain the cache represented as a JSON object.If this is something that you think could be done, I'd be happy to work up a PR if you'll give me a little guidance on how you'd like this implemented.