Dynamic control over ninja parallelism, or generic dynamic features

rgommers commented 1 year ago

First the specific feature we need in SciPy: set the number of parallel compile jobs for ninja to equal the number of available physical cores (n_phys). We have a number of reports about the ninja default setting, which is 2*n_phys + 2, either giving out of memory errors or even crashing the build machine outright:

https://github.com/scipy/scipy/issues/17941 is a report for a Chromebook running Debian (with a very good description of the problem and background info)
https://github.com/scipy/scipy/issues/18443 reports the issue on multiple Windows machines, with a regular 8-core CPU and 16 GB RAM
https://github.com/ninja-build/ninja/issues/2187 is a/the relevant upstream issue

For local development this was not difficult to address (see https://github.com/scipy/scipy/pull/18451), because SciPy has a developer interface where we can run custom code before invoking Meson. However, for pip install & co, there is no way to do that currently. So a user who may be installing some random package that happens to depend on SciPy may get crash or hang that we can't easily prevent. Hardcoding, say, -j4 is not great - there is no setting that works on low-end machines without throwing away lots of performance on high-end machines. The optimal settings seems to always be close to the number of physical cores.

So, ideally there would be a hook that meson-python exposes to package authors which can run arbitrary Python code to then set compile-args with.

I also see a more general pattern here, with gh-159 being another example of the need to execute code first and then set some build-related property (the package version in that case). It's a very similar case, and rather than a specific solution for each of these two problems, we may consider a general mechanism to deal with this kind of thing. Maybe a single Python function to execute, which must return a dict containing a pre-determined set of keys, including version, compile-args and the other *-args knobs.

Thoughts?

eli-schwartz commented 1 year ago

I'm curious why this would be considered a property of the source package, rather than a property of the build machine?

If it's just about the build machine, maybe the build backend itself should make the judgment call to calculate a different default instead of the default Ninja's own code uses, and pass that to ninja unless compile-args is used by the person running pip/build?

rgommers commented 1 year ago

It is indeed mostly dependent on the build machine. But I would say that the ninja default seems to be fine for smaller packages, but doesn't work for heavier builds, so in that sense it's also somewhat package-dependent.

meson-python changing the default to the number of physical cores would clearly be an improvement though, so that is also an option here.

baybal commented 1 year ago

I would like the build server to be able to limit parallelism to avoid OOM. I'm repeatedly having failed webkit builds until I run like -j3 of -j4. Some large C++ projects will fail to compile even with 8 free gigs of RAM

And another thing is to avoid swapping. A build witch hits swapping can run n-time longer than one which keeps everything in RAM.

eli-schwartz commented 1 year ago

A slightly more relevant ninja issue is https://github.com/ninja-build/ninja/issues/1482#issuecomment-433861702

Ideally this could be controlled via $NINJAFLAGS but the ninja developers seem to be resistant to the idea...

baybal commented 1 year ago

The better rephrasing of the issue I have is "you will run out of RAM much faster than your run out of CPU with many C++ programs". When compiling webkit setting -l0 make you run out of RAM within minutes.

baybal commented 1 year ago

And the niche case is distcc/icecream/sccache, where you want to completely disable parallelism management, beyond threat limit

ax3l commented 1 year ago

Adding another use case: In HPC, we compile often on shared resources where we cannot run with default -j: the system resource watch scripts will abort the ninja builds due to excessive load. https://github.com/scipy/scipy/issues/18766

I saw in other issues that there might be some resistance to add environment variable control in Ninja, but as a report from the trenches, I find CMake's CMAKE_BUILD_PARALLEL_LEVEL environment control extremely useful for complex build parallelism control. Maybe meson has a similar env variable or could consider adding one?

rgommers commented 1 year ago

I'd like to revisit this. Something has to be done, because the default behavior of pip install scipy having the potential to crash the build or even hanging a machine is not okay. Other heavy packages with C++ code are likely to have similar experiences in the future. The ninja devs are clearly not going to fix this, so here are the options we have:

If compile-args does not contain a -j argument, have meson-python set -j to the number of physical cores (taking into account CPU affinity etc. - I have well-tested code for this if we go that way)
- Pros: no configuration needed, and it's a better default on average (from GHA, where -j2 tends to be optimal to large build servers where 2*n_cores + 2 tends to be too high)
- Cons: it's a change in behavior, passing an extra flag is non-ideal, and it may interfere with custom workarounds people already have
Adding a setting like use-physical-cores = false in pyproject.toml, that when set to true does the same as (1).
- Pros: fixes the problem too, typically packages will figure this out after the first or second bug report
- Cons: it's an extra config knob, which is extra overhead
Some kind of hook that allows packages to run custom code and then pass extra compile-args.
- Pros: at least allows fixing the problem, and perhaps it will turn out to be useful for other pre-build actions
- Cons: it's pretty complicated both to implement and to use, and I'm not sure we want to take on the extra maintenance burden

I'd be inclined to go with (2), for safety / backwards compatibility.

dnicolodi commented 1 year ago

This problem arises from the Python packaging ecosystem decision of treating binary or source installs as equivalent and transparently fall back to compiling from source when a binary package is not available. With this premise, it is always possible to find an environment constrained enough where building a given package fails, whatever heuristic is used to determine the parallelization settings.

I'm surprised that the ninja defaults are not adequate for large packages. I believe ninja was developed to compile Chrome, which is a huge C++ application, thus should be close to the worst case in terms of resources usage during compilation.

Any heuristic based on the number of processors does not look like a promising solution to me. I haven't looked at the bug reports in detail, but isn't ninja -l option helping in this case? Is it available at all on Windows?

rgommers commented 1 year ago

ninja -l uses a measure of CPU utilization. I'm not sure if it can help with @ax3l's HPC use case, it may.

It doesn't look like it will do anything for the pip install heavy_package case in general, because it doesn't look at memory usage and that is the problem here. You want 100% CPU utilization on something like GitHub Actions, so defaulting -l to 50% may end up using only a single core when you have two vCPUs.

Also, that option doesn't seem to work on Windows, is undocumented, and even seems broken on Linux for me - I can only make it switch between 1 process and 24 on my 12-core machine.

dnicolodi commented 1 year ago

it doesn't look at memory usage and that is the problem here

This is exactly the reason why I don't think that an heuristic based on the number of CPU is beneficial in most cases. There is only so much correlation between the number of CPU and the quantity of memory available.

dnicolodi commented 1 year ago

Another possible solution is to allow variable substitutions in the arguments passed to meson. This way it would be possible to write something like:

[tool.meson-python.args]
install = ['-j', '%(ncores)s']

or something like this. We could even think about allowing simple arithmetic expressions, like %(ncores * 2 / 3)s, although this would require some more work to implement. The drawback is that this is flexible and extensible, but I don't see much use for this mechanism outside this use case.

rgommers commented 1 year ago

There is only so much correlation between the number of CPU and the quantity of memory available.

True - but it is a significant enough correlation that it'd help a lot to implement this. The problem here seems to be that the typical consumer laptop comes with a CPU that has <= 4 physical cores, and 8 GB of RAM. And then 1 GB memory usage per process causes hangs/crashes on a heavy project like SciPy, while something near 0.5 GB seems safe. Shifting a "long tail" of problematic cases by 2x would reduce problems by much more than 2x it looks like.

Another possible solution is to allow variable substitutions in the arguments passed to meson.

Interesting idea. I think I like it. Maybe it's simple enough if we only allow *, /, +, - and Python scalars (or integers)?

dnicolodi commented 1 year ago

Interesting idea. I think I like it. Maybe it's simple enough if we only allow *, /, +, - and Python scalars (or integers)?

This is another demonstration that all configuration formats need to be lisp, and when they are not, they painfully converge to a badly designed and poorly implemented version of lisp 😄

dnicolodi commented 1 year ago

Interesting idea. I think I like it. Maybe it's simple enough if we only allow *, /, +, - and Python scalars (or integers)?

I've implemented the parsing and evaluation of the expressions. Using the ast module it can be done in surprisingly few lines of code. Which format would you like the best for the placeholders?

I think %(ncores)d and the extended version %(ncores * 2 / 3)d looks familiar (at least to the Python programmers grow up at the time pf Python 2 and the % operator).

Using the format string syntax has the advantage of already having a ready parser in string.Formatter but has the problem that {ncores * 2 / 3 :d} is not valid as it tries to format a float as an integer.

dnicolodi commented 1 year ago

I actually just realized that supporting the %(ncores)d format is even easier. I'm opening a PR with a draft implementation.

mesonbuild / meson-python

Dynamic control over ninja parallelism, or generic dynamic features #416