Concurrency turning out useless on codebase & machine

xmo-odoo commented 5 years ago

This is on a codebase with 260kLOC across ~3600 files (python-only, according to tokei), on a 2010 MBP (2 cores 2 HT) running OSX 10.11, under Python 3.6.6 from macports

Using -j with a number of jobs different from 1 significantly increases CPU consumption (~90%/core), but yields no improvement in wallclock time:

> pylint -j1 *
pylint -v -j1 * 1144.10s user 44.51s system 96% cpu 20:36.81 total
> pylint -j2 *
pylint -j2 * 2386.66s user 117.09s system 184% cpu 22:37.15 total
> pylint -j4 *
pylint -j4 * 3897.49s user 161.62s system 340% cpu 19:50.96 total
> pylint -j0 *
pylint -j * 3850.79s user 155.45s system 341% cpu 19:31.81 total

Not sure what other informations to provide.

PCManticore commented 5 years ago

Wow, that's incredible, thanks for reporting an issue. I wonder if the overhead of pickling the results back to the workers is too big at this point, we'll have to switch our approach for the parallel runner if that's the case.

belm0 commented 5 years ago

I've observing this too (OS X, i7 processor). --jobs merely multiplies the CPU time, with negligible effect on wall time.


$ pylint --version
pylint 2.1.1

$ time pylint my_package/
real    1m27.865s
user    1m25.645s
sys 0m1.996s

$ time pylint --jobs 4 my_package/
real    1m17.986s
user    4m14.076s
sys 0m12.917s

Tenzer commented 4 years ago

I found Pylint got a lot faster by removing concurrency (jobs=1) compared to trying to make it as concurrent as it should (jobs=0). The execution time across a number of different project code sizes sped up by 2.5-3 times.

A concrete project has 134k LOC across 1662 Python files and a Pylint run across all the files dropped from 3m 33s to 1m 30s on average on a MBP dual core (with HT). CPU utilisation also dropped to less than half according to CPU time.

I wonder if there's any cases where running Pylint concurrently is helping, or if it would be better to disable the feature for now?

owillebo commented 3 years ago

Some results on Windows 10 2004, Intel 9850H 6 cores/12 threads 32 bit pylinting matplotlib. Interesting is that with roughly half the threads we get fastest result. Results in seconds wall clock duration.

pylint --version

pylint 2.6.0
astroid 2.4.2
Python 3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020, 18:01:55) [MSC v.1900 32 bit (Intel)]

cloc matplotlib

 360 text files.
 352 unique files.
 154 files ignored.

github.com/AlDanial/cloc v 1.86  T=1.09 s (226.7 files/s, 144433.3 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                         221          25052          39919          85792

Running pylint with the default configuration;

pylint -j12 matplotlib  1>NUL
71.7007919
pylint -j11 matplotlib  1>NUL
71.2478186
pylint -j10 matplotlib  1>NUL
69.9589435
pylint -j9 matplotlib  1>NUL
69.8973282
pylint -j8 matplotlib  1>NUL
66.9836301
pylint -j7 matplotlib  1>NUL
67.7956229
pylint -j6 matplotlib  1>NUL
65.0402625
pylint -j5 matplotlib  1>NUL
67.2663403
pylint -j4 matplotlib  1>NUL
73.0464569
pylint -j3 matplotlib  1>NUL
88.9432869
pylint -j2 matplotlib  1>NUL
120.4550162
pylint -j1 matplotlib  1>NUL
238.5004808

Pierre-Sassoulas commented 3 years ago

@owillebo thank for the data, I think intuitively it makes sense that the optimal is 6 threads on a 6 core machine. Apparently this bug is not affecting you.

owillebo commented 3 years ago

Thanks,

I think utilizing all (12) available threads and halving the time for running Pylint is a good thing. Burning threads is a waste of time and resources. I think this bug is affecting more than myself (which is indeed of less importance).

On Sun, Sep 27, 2020, 17:11 Pierre Sassoulas notifications@github.com wrote:

@owillebo https://github.com/owillebo thank for the data, I think intuitively it makes sense that the optimal is 6 threads on a 6 core machine. Apparently this bug is not affecting you.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PyCQA/pylint/issues/2525#issuecomment-699647530, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXQ4KTGMDEX5PHGESTNQYDSH5IZTANCNFSM4FXU6BDQ .

xmo-odoo commented 3 years ago

@owillebo thank for the data, I think intuitively it makes sense that the optimal is 6 threads on a 6 core machine. Apparently this bug is not affecting you.

Indeed, hyperthreading can lead to better use of the underlying hardware, but if there are no significant stalls (or both hyperthreads are stalled in similar ways) and all the threads are competing for the same underlying units the hyperthreads are just going to sequentially use the same resources.

And the can is so conditional that, given the security issues of their implementation, Intel is actually moving away from HT: the 9th gen only uses HT at the very high end (i9) and very low (Celeron) ends, none of the 9th gen i3, i5 and i7 supports hyperthreading.

owillebo commented 3 years ago

If I run two pylint sessions concurrently each with 6 jobs and another half of the matplotlib files, the wall clock duration drops from 65 seconds (for all files in 1 session) down to 60 seconds. Indeed my threads don't bring much.

On Mon, 28 Sep 2020 at 10:10, xmo-odoo notifications@github.com wrote:

@owillebo https://github.com/owillebo thank for the data, I think intuitively it makes sense that the optimal is 6 threads on a 6 core machine. Apparently this bug is not affecting you.

Indeed, hyperthreading can lead to better use of the underlying hardware, but if there are no significant stalls (or both hyperthreads are stalled in similar ways) and all the threads are competing for the same underlying units the hyperthreads are just going to sequentially use the same resources.

And the can is so conditional that, given the security issues of their implementation, Intel is actually moving away from HT: the 9th gen only uses HT at the very high end (i9) and very low (Celeron) ends, none of the 9th gen i3, i5 and i7 supports hyperthreading.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PyCQA/pylint/issues/2525#issuecomment-699852132, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXQ4KW3NK6KZODBTEDKV43SIBAGDANCNFSM4FXU6BDQ .

DanielNoord commented 2 years ago

https://github.com/PyCQA/pylint/issues/6978#issuecomment-1159559260 is quite an interesting result.

xmo-odoo commented 2 years ago

#6978 (comment) is quite an interesting result.

That is true but I think it's a different issue: in the original post the CPU% does grow pretty linearly with the number of workers, which indicates that the core issue isn't a startup stall (#6978 shows clear CPU usage dips).

xmo-odoo commented 2 years ago

Also FWIW I've re-run pylint on the original project, though only on a subset (as I think the old run was on 1.x, and pylint has slowed a fair bit in the meantime, plus the project has grown).

This is on an 4 cores 8 threads Linux machine (not macOS this time), Python 3.8.12, 2.14.5.

The subsection I linted is 71kLOC in 400 files. The results are as follow:

-j0 pylint -j$i * > /dev/null  206.82s user 1.05s system 99% cpu 3:27.90 total
-j1 pylint -j$i * > /dev/null  205.74s user 1.08s system 99% cpu 3:26.85 total
-j2 pylint -j$i * > /dev/null  163.57s user 1.59s system 199% cpu 1:22.77 total
-j3 pylint -j$i * > /dev/null  198.93s user 2.15s system 298% cpu 1:07.29 total
-j4 pylint -j$i * > /dev/null  238.08s user 2.52s system 384% cpu 1:02.55 total
-j5 pylint -j$i * > /dev/null  304.31s user 3.00s system 450% cpu 1:08.26 total
-j6 pylint -j$i * > /dev/null  374.35s user 3.96s system 551% cpu 1:08.61 total
-j7 pylint -j$i * > /dev/null  462.39s user 4.68s system 639% cpu 1:13.04 total
-j8 pylint -j$i * > /dev/null  487.39s user 5.20s system 688% cpu 1:11.56 total

pylint does seem to scale to -j2, there's even a minor gain at -j3 (though far from 50%), beyond that it again spins its wheels and burns CPU with no improvement (the opposite really)
I thought j0 would be equivalent to j8 but apparently it's j1?
I'm not entirely sure why j1 costs so much more than j2 (almost 3x the wallclock time, and 20% higher USER), but it is repeatedly reproducible, I ran each 5 times in a row, and they exhibited those behaviors and wallclocks (roughly) very reliably, in fact 200-ish USER is on the lower end of j1 (it goes as high as 300), while 160-ish USER is about par for j2.

olivierlefloch commented 1 year ago

On M1 Macs, on large codebases, -j=0 is equivalent to -j=10, and seems (unsurprisingly, given high performance vs efficiency cores) to perform worse than -j=6 ; this makes it difficult to specify a single value in the shared pylintrc config file for a repository shared between developers using a broad variety of machines, and likely makes -j=0 undesirable on recent Apple machines.

xmo-odoo commented 1 year ago

@olivierlefloch I don't think that can be solved by pylint (or any other auto-worker program): a while back I tried to see if the stdlib had a way to know real cores (as opposed to vcores / hyper threads) due to the comment preceding yours and didn't find one. I don't remember seeing anything for efficiency/performance either.

I think the best solution would be to run under a bespoke hardware layout (make it so pylint can only see an enumerated set of cores of your choice), but I don't know if macos supports this locally (I don't remember something similar to linux's taskset). There is a program called CPUSetter which allows disabling cores globally, but...

Also it doesn't seem like e.g. multiprocessing.cpu_count() is aware of CPU affinity, however pylint already uses os.sched_getaffinity so it should work properly on linux.

pylint-dev / pylint

Concurrency turning out useless on codebase & machine #2525