Checker plugin parallelization breaks algorithm

pylint-dev / pylint

It's not just a linter that annoys you!

https://pylint.readthedocs.io/en/latest/

GNU General Public License v2.0

5.23k stars 1.11k forks source link

Checker plugin parallelization breaks algorithm #7263

Open smac89 opened 2 years ago

smac89 commented 2 years ago

Question

I am building a checker plugin which is supposed to use data from all the files checked to compose a report. However, I am seeing that whenever I run pylint, my plugin is ran multiple times, and each time it is using incomplete data to do its tests.

My plugin is already making use of get_map_data and reduce_map_data to merge data between processes. Now I want to know how to avoid doing the actual checks (which are done in the close method), until the last minute. How do I save all data between plugin runs until the last one so that only the last thread/last run for the checker actually does the work required? Is this even possible? Is there a way to tell pylint that a checker should not be run in parallel? If not, what do you advice?

Documentation for future user

https://pylint.pycqa.org/en/latest/development_guide/how_tos/plugins.html

Additional context

No response

smac89 commented 2 years ago

My bad. I think I see my mistake now. The checks should be done in the reduce_map_data.

So that brings me to my next question:

I have now moved the algorithm which does the checks into a separate method called _run, and it is called by reduce_map_data. What I now do is that in the close method, I check if the linter was run with just one job (self.linter.config.jobs == 1), and then I will call the _run method inside the close method.

Is this the correct way of doing that?

Pierre-Sassoulas commented 2 years ago

Thank you for opening the issue. As far as I know you're navigating in uncharted territory here. Some pylint checkers should use this but are not right now or have issues (see the multiprocessing label) ! So I welcome you're insight on a more intuitive design. It seems that the close function should handle the reduce map generically if I understand right ? The base class could be extended if required.

(Pinging @doublethefish as they designed and implemented this).

doublethefish commented 2 years ago

The map/reduce mixin (pylint.checkers.mapreduce_checker.MapReduceMixin) itself isn't designed to distribute computation (pylint already is trying to do that via multiprocessing in lint.parallel.check_parallel), it just helps collate data about those runs. Really it's a workaround to some deeper underlying issues in pylint's multi-job processes.

The multi-jobs aspect of pylint itself isn't really fit for purpose, mainly because it doesn't make runs faster (!), IIRC because astroid computes a full ast-graph, on each thread, before doing the work, so the actual compute-work by the checkers get marginal gain in a multithreaded runtime (with caveats). Also, the whole parallelized code path is so different from the single-threaded code path we get differences in reporting on stats. As such, I strongly do not recommend anyone uses -j N or bothers using the map-reduce functionality until the deeper issues are solved - with all that said I haven't profiled it this year and things may have changed.

One of the problems with the map/reduce functionality is that it also relies on the checker to collate its stats into the calling-linter rather than rely on the normal apis for doing so.

The map/reduce functionality is designed to work with the current parallelized system, that is:

N instances of a checker are run (by lint.parallel.check_parallel).
Each checker gets tasked with looking at num-files/N files.
Each instance then records data about its run (map).
The data is collated post-run (reduce).
Stats are attempted to be collated post-run.

hth

Pierre-Sassoulas commented 2 years ago

because astroid computes a full ast-graph, on each thread, before doing the work, so the actual compute-work by the checkers get marginal gain in a multi-threaded runtime (with caveats).

That's very insightful, thank you. Somehow I did not realize we had so much to do before this comment, even as I realized the file filtering was not done efficiently a few days ago in #7220. So thanks, and to sum-up before having "real" multiprocessing we need to:

1) Filter the files we're going to run on (https://github.com/PyCQA/pylint/pull/7220#discussion_r928974356) 2) compute the AST first 2) Actually parallelize the AST creation because this is most likely one of the bottle neck (!!)

I agree with @doublethefish that we should fix at least the two first points before trying anything else (third point would happen in astroid, and it's not going to be a small task.) and that multiprocessing is unusable or at least not very useful until then.

doublethefish commented 2 years ago

My view remains that lint.parallel.check_parallel should be fully incorporated into pylint.lint.pylinter.PyLinter.check and not just invoked by it. A better candidate may be pylint.lint.pylinter.PyLinter._check_files because that is where we know what work needs to be done.

That is, the for loops in those functions should make adding a basic multiprocess divide-and-conquer pretty painless, as long as the stats and data generated, can be gathered at the end.

The major benefit of this is that you have one way to do all the work, instead of two that drift - and therefore have a more stable product.

smac89 commented 2 years ago

The multi-jobs aspect of pylint itself isn't really fit for purpose, mainly because it doesn't make runs faster (!), IIRC because astroid computes a full ast-graph, on each thread, before doing the work, so the actual compute-work by the checkers get marginal gain in a multithreaded runtime (with caveats). Also, the whole parallelized code path is so different from the single-threaded code path we get differences in reporting on stats. As such, I strongly do not recommend anyone uses -j N or bothers using the map-reduce functionality until the deeper issues are solved - with all that said I haven't profiled it this year and things may have changed.

I feel like this should be emphasized a lot more. Pylint has had the -j option for a while now, but the docs ~~make no mention of the pros and cons of using it~~ do not do a good job of mentioning all the cons of using it and leaves it up to the users to think it is all pros and some minor cons. Specifically, this part of the documentation needs to be updated to mention more of what @doublethefish said in their comment.

Speaking of documentation, the part where it mentions that --load-plugins option does not work when running pylint with multiple jobs, I find this to be untrue with pylint 2.14+.

There are some limitations in running checks in parallel in the current implementation. It is not possible to use custom plugins (i.e. --load-plugins option), nor it is not possible to use initialization hooks (i.e. the --init-hook option).

I'm able to run pylint with jobs=0, while still being able to test a custom plugin loaded via --load-plugin. In fact, this is the only way I was able to discover the multiprocessing behaviour that I asked about.

doublethefish commented 2 years ago

the documentation needs to be updated

Maybe. Although I'm not sure the work has been done to demonstrate/understand the ins and outs here to document it properly. For example, I am not sure when the performance of -j N decreases but I am moderately confident that it is correlated to things like code-base size and available memory. That said, I also suspect that the difference between -j 1 and -j 32 is negligible in all cases.

That is why I've mentioned before that -j should probably be disabled entirely, not removed, making -j 2+ always follow the -j 1 code path. At least until the problem is fully understood and described. In this case, it could just be removed from the documentation.

Personally, I stopped using -j <num-cores> when I introduced the profiling into CiCd steps as I realised it was costing me too much money+time, and have fantasized about implementing a better multi-proc solution since, but I don't understand how the stats are collated nor how astroid works in a pylint env.

DanielNoord commented 2 years ago

@doublethefish Your previous comments triggered some thinking about this issue for me.

Can we split the ast creation and ast checking in separate functions and only serialise the second part in pylint? The biggest performance benefits would definitely be in parallelising ast creation, but that should be handled by astroid instead of the current (dysfunctional) pylint approach.

This would greatly reduce the performance benefit the jobs options can currently provide but does allow fixing all issues related to it in the package that we actually control in this repository.

We should then create an astroid issue for an "parallelised ast creation API"

doublethefish commented 2 years ago

I imagine there would be several stages of attack, in order:

gen the ast on main and share with sub-procs.
multi-proc/thread the ast-generation.
cache the ast (ala mypy & pytest).
incrementally update the ast cache.

1 has the fastest route-to-success.

2-4 could be transparent to pyint itself. If we're not careful the gains here would be minimal, especially in CiCd situations with a cold-cache.

1 and 2 could be worked on asynchronously.

Without wanting to sound too grandiose, with all four, given the widespread use of pylint, there would be a significant impact on energy used by global python CiCd processes.

smac89 commented 2 years ago

I should also add that during my attempts to make multiprocessing work, any attempts to return an astroid node in get_map_data fails (during the pickling stage), due to the fact that pylint tries to pickle the data that is returned by each process, but I don't think astroid nodes are picklable.

So if you plan on sharing the ast with various process, it has to be in a form that makes sharing possible. This is what I had to do

Pierre-Sassoulas commented 2 years ago

-j should probably be disabled entirely, not removed, making -j 2+ always follow the -j 1 code path.

Agree with that, we'd remove a whole class of errors until pylint is actually multi-process capable and this is the current situation anyway (#2525). Maybe we add a user warning to make it obvious.

Without wanting to sound too grandiose, with all four, given the widespread use of pylint, there would be a significant impact on energy used by global python CiCd processes.

This cannot be overstated, making pylint more efficient is going to yield huge power savings worldwide.

Also we have a lot more documented tooling around performance now than when you started working on this Franck see https://pylint.pycqa.org/en/latest/development_guide/contributor_guide/profiling.html

due to the fact that pylint tries to pickle the data that is returned by each process, but I don't think astroid nodes are picklable.

We have a dependency to dill in pylint to serialize data, we could add it in astroid at almost no cost... Or we could think about another solution and remove dill from pylint if we find another solution because as far as I know it's only used by the current multiprocessing to serialize the configuration @DanielNoord ? More dependency means more problems, a new version of dill is required for supporting python 3.11 and we've been waiting for a long time. We also had a massive problem because colorama and dill were not compatible for an uncomfortably long time.

DanielNoord commented 2 years ago

The dependency on dill is needed because we (currently) need to serialise the PyLinter object, which isn't serialisable by default by pickle.

I agree that the waiting has been quite long, but if we really need to I can create a fork and make a release with the fix included. Due to the summer period I haven't really had the time to work on pylint and 3.11 anyway.

gen the ast on main and share with sub-procs.

multi-proc/thread the ast-generation.

cache the ast (ala mypy & pytest).

incrementally update the ast cache.

I would prefer if we could focus on 1. All other three options should be handled by astroid, that's the package which creates the ast and should therefore also be in charge of (incrementally) updating it or (in)validating caches. pylint should only concern itself with parallelising the "linting" in my opinion.

DanielNoord commented 2 years ago

I have done some digging into how this is currently set up and compared it against flake8's implementation. However, after thinking about this some more I think the current design of pylint makes it very difficult to improve upon this.

Since we are reusing the PyLinter class everywhere it is almost unavoidable that we're using the same memory somewhere. I think that is the most plausible explanation as to why we see an improvement for -j2 but a decrease for -j4, -j5, etc. flake8 has a pretty effective approach to multiprocessing and builds an iterable of FileCheckers which are self-contained instances of a class that can completely lint a file and then return a result to imap_unordered.

I thought about doing some refactoring to mimic this, but I don't think that is really feasible. Even if we created a FileLinter class that is responsible for storing messages and checking whether they are enabled we still rely on linter for all checkers. Their __init__ signature is Callable[[PyLinter], None]. Refactoring this would be a major breaking change and likely break any current plugin. Basically for multiprocessing to work you need a self-contained iterable of data to which you can apply a function. Due to pylint's design and reusing of the PyLinter class I don't think we can create a self-contained data type without any references to some shared states or data.

doublethefish commented 2 years ago

Thanks for sharing the fascinating differences between flake8 and pylint. Really interesting.

we still rely on linter for all checkers

Yes, I noticed. I don't understand quite why that is yet, but I would love to (?).

Basically for multiprocessing to work you need a self-contained iterable of data to which you can apply a function. Due to pylint's design and reusing of the PyLinter class I don't think we can create a self-contained data type without any references to some shared states or data.

Agreed.

The map-reduce design I put in place is an optional mixin that gives us the "self-contained iterable of data to which you can apply a function". So that's a partial solve, on part of the data. IIRC the mixin is at the Checker level and the missing piece is handling the messages and the stats at the PyLinter level.

I don't think we can create a self-contained data type without any references to some shared states or data [in the shared linter]

The remaining problems stem from this I feel. In previous posts I've intimated that this might be an API problem in pylint, but that can also be viewed as a data-access problem (as you say), same thing, different perspective.

There're several ways to approach this:

keep the linter object as is but change the underlying data so that it too can be returned by somehow by imap_unordered and accessed/recombined/reduced/mapped (this might already be possible?).
change the Callable[[PyLinter], None] type to something like Callable[[LinterDataAPI], None] where LinterDataAPI just contains the minimal API to support all existing Checkers. And then make LinterDataAPI the multithreaded/multiprocess/asynchio object that is responsible for collating checker results (stats/messages). If pylint is mypy checked, fully, then this would be trivial, but time-consuming, work. The type could be something as simple as a PyLinter shim wrapping inter-proc/thread comms.
consider that Multiprocessing is possibly the wrong approach? Given that currently, PyLint is IO-Bound it is likely - and I don't know for sure because I haven't yet looked at the shared-data access points - that moving to a threaded approach would reap huge rewards at the cost of a few thread-synch points in the code. This would probably be harder if astroid isn't thread-safe. However, identifying the areas of code that would need these thread-sync primitives then gives you the API surface area, and therefore you are doing a variant of option 2, so perhaps option 2 is less risky/faster overall.

DanielNoord commented 2 years ago

Yes, I noticed. I don't understand quite why that is yet, but I would love to (?).

See below.

The map-reduce design I put in place is an optional mixin that gives us the "self-contained iterable of data to which you can apply a function". So that's a partial solve, on part of the data. IIRC the mixin is at the Checker level and the missing piece is handling the messages and the stats at the PyLinter level.

Yeah, the only issue is that the current design makes the Checker level not really sufficient for splitting the work into processes 😅

The remaining problems stem from this I feel. In previous posts I've intimated that this might be an API problem in pylint, but that can also be viewed as a data-access problem (as you say), same thing, different perspective.

Yeah, I think this is ultimately an API problem.

consider that Multiprocessing is possibly the wrong approach? Given that currently, PyLint is IO-Bound it is likely - and I don't know for sure because I haven't yet looked at the shared-data access points - that moving to a threaded approach would reap huge rewards at the cost of a few thread-synch points in the code. This would probably be harder if astroid isn't thread-safe. However, identifying the areas of code that would need these thread-sync primitives then gives you the API surface area, and therefore you are doing a variant of option 2, so perhaps option 2 is less risky/faster overall.

I have no experience here, but I am indeed not sure how feasible this is.

change the Callable[[PyLinter], None] type to something like Callable[[LinterDataAPI], None] where LinterDataAPI just contains the minimal API to support all existing Checkers. And then make LinterDataAPI the multithreaded/multiprocess/asynchio object that is responsible for collating checker results (stats/messages). If pylint is mypy checked, fully, then this would be trivial, but time-consuming, work. The type could be something as simple as a PyLinter shim wrapping inter-proc/thread comms.

This would probably be the best solution, but I don't think we can do it in a backwards compatible way. PyLinter is currently responsible for:

Emitting messages
Checking if messages that are being emitted are not disabled
Providing access to the configuration object
Defining messages itself that can be emitted by other checkers
Defining configuration settings that can be used by other checkers
Finding all plugins and checkers that need to run
Initialising the reporter and finding the correct one
Determining the files to lint
Getting the AST representations of all files to lint
etc.

At least the first 5 need to be accessed by the checkers as well. This is why PyLinter is in their __init__. I thought about passing the config object to them, but there are still many other things we need to extract from PyLinter in order for a checker to be a standalone class without any direct reference to the PyLinter class. Many of the methods we would need to extract are also considered public so we can't actually remove them without first deprecating them, which requires us to maintain two systems at the same time.

doublethefish commented 2 years ago

Yeah, the only issue is that the current design makes the Checker level not really sufficient for splitting the work into processes 😅

Going back a few steps, this is why I asked if perhaps splitting the work into processes after the linter has been setup and configured (in check_files) might be better? instead of splitting the work before the linter has been constructed (that is IIRC and please forgive my poor memory here, I don't have time to open the code right now).

It all sounds quite complex when put in your list. I remember that Pierrre did a pretty good job of splitting the code into core responsibilities (because I had to resolve a nasty merge conflict in the map-reduce stuff 😭) and it seems some more of that would be helpful. It also seems like there are several major areas of responsibility, at the linter level: start-up, run, report - but they're a bit intermingled at the moment. To map it to something completely different, game engines: the Unreal Engine divides its work into several discrete areas; game-logic, render-logic, shader-logic, physics & animation - it works so well because it divides and conquers those concerns, with neat channels of work hand-off. Unity is more of a component/task system and sounds more akin to what flake8 does, in that you have many, many more discrete tasks that can sort of work in isolation. It seems that pylint needs to be a bit more like Unreal, at least in the first instance.

Many of the methods we would need to extract are also considered public so we can't actually remove them without first deprecating them, which requires us to maintain two systems at the same time.

We already have two systems. Making that explicit would help clarify the system and likely reduce bugs.

It is likely that there is a nice and tidy migration path to the next Major version.

DanielNoord commented 2 years ago

Going back a few steps, this is why I asked if perhaps splitting the work into processes after the linter has been setup and configured (in check_files) might be better? instead of splitting the work before the linter has been constructed (that is IIRC and please forgive my poor memory here, I don't have time to open the code right now).

Yeah, that would be best. However, they way checkers are currently designed we need access to the linter while running any check. So, we construct a linter and then copy the linter for the amount of processes we spin up. I feel like this is the core of our problem: as we are passing around a pretty large class which has references to almost all other classes pylint instantiates through the linter attribute of those classes. (For example, the configuration parser also has a linter attribute).

Ideally, a checker would have a __init__ of Callable[[ConfigObject, DisableChecker, GeneralMessages, GeneralConfig, ...], None]. So rather than having to look stuff up dynamically on the linter object we just pass new objects without any reference to their source. However, that would be a major breaking change.

It all sounds quite complex when put in your list. I remember that Pierrre did a pretty good job of splitting the code into core responsibilities (because I had to resolve a nasty merge conflict in the map-reduce stuff 😭) and it seems some more of that would be helpful. It also seems like there are several major areas of responsibility, at the linter level: start-up, run, report - but they're a bit intermingled at the moment. To map it to something completely different, game engines: the Unreal Engine divides its work into several discrete areas; game-logic, render-logic, shader-logic, physics & animation - it works so well because it divides and conquers those concerns, with neat channels of work hand-off. Unity is more of a component/task system and sounds more akin to what flake8 does, in that you have many, many more discrete tasks that can sort of work in isolation. It seems that pylint needs to be a bit more like Unreal, at least in the first instance.

I have actually been working on this as well by creating a lot more base classes for PyLinter. For example, it is now a _ArgumentsManager and a _MessageStateHandler. The issue is though that many of the methods defined on those base classes are used throughout pylint in many different places. For example, we currently can't pass Callable[[_MessageStateHandler, _DisableStateHandler], None] as we would miss out on crucial other methods of PyLinter.

What flake8 seems to do is: let the checker emit the message and then let the Manager handle actually showing the messages based on the configuration in the clean-up phase. In pylint it is not the Manager or PyLinter that is responsible, but the checker. However, the checker can only determine this by accessing methods and data that is actually stored on the PyLinter instance. So rather than returning a sequence of messages to potential emit and then let a clean-up phase deal with actually emitting, in pylint we need constant access to all global states to determine whether messages need to be emitted or not. This makes it very difficult to decouple PyLinter from the checker classes.

Furthermore, consider a checker that emits test-message which is a very performance intensive check. Currently we expose methods on checker.linter, such as is_message_enabled, to check whether the message is active. This can be used to skip a code path to improve performance (but come with a lot of issue with useless-suppression .... 😅). However, it is also another dependency on linter. We could obviously pass the data that is_message_enabled uses in the __init__ as well, but that would require all plugin writes to change their code.

Lastly, we even support changing checker.linter.config and to have any changes there also affect other checkers as they all use the same config object. If we were to pass a ConfigObject to each checker and then split them across process we would also no longer be able to do this.

Pierre-Sassoulas commented 2 years ago

I agree with @doublethefish that we need to make PyLinter more modular. I think we can keep the interface of PyLinter by simply calling a more modular class we're now using. Inside pylint we then use the underlying modular class that is lighter and faster to construct. For example, a first actionable step would be to create a MainChecker that is really only a checker, separated from the PyLinter, so we can add message by giving a checker with a light constructor instead of the PyLinter. (In fact each bullet point of @DanielNoord's list could be a refactor in itself but imo the MainChecker is an important one).

DanielNoord commented 2 years ago

I think the separation of MainChecker is the only thing that might be possible. All other potential modular classes would include public functions that should still remain on PyLinter in order not to break compatibility, which defeats the point of the modularity in terms of making PyLinter more light weight.

Pierre-Sassoulas commented 2 years ago

All other potential modular classes would include public functions that should still remain on PyLinter in order not to break compatibility, which defeats the point of the modularity in terms of making PyLinter more light weight.

But can't we use the modular classes in threads instead of the PyLinter, and then keep the public API like it is right now in PyLinter by using composition and calling the modular class directly ? I.e.


class PyLinter:
    def __init__(main_checker: MainChecker):
        self._main_checker = main_checker

    def add_message(...):
        warnings.warn("in pylint 3.0...")
        return self._main_checker.add_message(...)

DanielNoord commented 2 years ago

That wouldn't have any effect on the size of the object to be pickled. In fact, due to differences in pickling of attributes instead of parent classes. As long as there is a reference to some instantiated class on PyLinter it will be pickled.

Pierre-Sassoulas commented 2 years ago

But isn't the point of this also to make it possible for the modular classes to work in isolation for multi threading ? (i.e. the size of the pickled data isn't the only concern)

DanielNoord commented 2 years ago

But isn't the point of this also to make it possible for the modular classes to work in isolation for multi threading ? (i.e. the size of the pickled data isn't the only concern)

It's not so much about the size of the data as about the amount of shared memory, which I think will be an issue for multithreading as well. As long as we are able to do everything at the same time (checking a module, processing disable pragma's, emitting messages, defining new messages, etc.) we need a lot of memory to be available to all processes. I believe that is currently the major blocker for any successful parallelisation attempt. The requirement of shared memory translates into a larger pickable object, but also affects other approaches I think. I don't think the size of the pickable object is the only problem for multiprocessing currently.

Pierre-Sassoulas commented 2 years ago

The way I understand it what needs to be done is:

[ ] disable the -j option entirely
[ ] make -j 2+ always follow the -j 1 code path, warn the user with a link to #2525
[ ] (Release this in a minor ?)
[ ] Create MainChecker from Pylinter
[ ] Reevaluate the problem at this point to see what we can extract from the PyLinter to get maximum benefit ?

What do you think ?

DanielNoord commented 2 years ago

I think -j 2 should still be allowed. Even -j 3 shows benefit on my own machine. 4 seems to be when the issues really start.

In terms of refactoring I'm not sure. I have tried numerous things today and none of them seemed to work. I'm certainly no expert on multiprocessing and it doesn't help that many profiling tools struggle with it, but we would need a definitive answer about what potentially works and what is the main problem right now.