linuxboot / heads

A minimal Linux that runs as a coreboot or LinuxBoot ROM payload to provide a secure, flexible boot environment for laptops, workstations and servers.
https://osresearch.net/
GNU General Public License v2.0
1.41k stars 185 forks source link

Parallel make doesn't work on a clean checkout #935

Closed osresearch closed 2 years ago

osresearch commented 3 years ago

There are some components that do not correctly depend on the cross compiler, so a make for the clean checkout fails. Easy fixes are the various tools in $(COREBOOT_UTIL_DIR) like cbmem, superiotool and inteltool, which need to wait until the cross compiler is available.

bwachter commented 3 years ago

I'm currently running into this trying to build one of the x230 pull requests. -j isn't propagated to the subdirectories, and setting MAKE_JOBS breaks bootstrapping of the toolchain. Without anything set everything runs as -j1, though, which takes ages.

At least some of that seems to have been introduced in the past 6 months - I remember having some initial issues to make it use parallel builds back in summer, but once I had that working it took about 5 minutes for a full build (including bootstrapping the tool chain). So far I've been sitting on this now for about 1 hour, and still didn't get past the tool chain bootstrapping.

tlaurion commented 3 years ago

@bwachter documentation has been updated here, rendered here, to state that CPUS=YY should be specified on the make BOARD=XXX Heads board build statement.

I'm still a bit confused here, since the main Makefile of Heads is taking the ouptut of nproc if not specified and populates CPUS variable which is then passed along if not defined on make initial call.

That hack was implemented, because some modules won't like to have -j forced, where others play along. The idea here was to pass CPUS down in other makefiles (modules/*) where they play fair, and go single threaded for when modules don't play well.

@osresearch?

osresearch commented 3 years ago

The issue with CPUS=xxx on the make command line is that it does not spawn multiple top-level jobs. So while the individual components might be built in parallel, there is no parallelism across the modules. On a rebuild after a make real.clean (so the cross compilers are intact) only one module is built at a time. On my build machine this makes the difference between a 90 second rebuild and a much longer process process.

make real.clean && time make V=1 CPUS=128
...
real    30m2.657s
user    25m18.422s
sys 7m0.333s

versus

make real.clean && time make V=1 -j128
...
real    1m27.378s
user    30m26.307s
sys 6m12.245s
tlaurion commented 3 years ago

@osresearch The problem with -j24 is that there is no locking between interdependent tasks.

Here, I launch the same build with CPUS=24(functional) and -j24(failing badly with files missing) over CircleCI.

The line rm -rf build/x230-hotp-maximized/* build/log/* && make -j24 V=1 BOARD=x230-hotp-maximized || touch /tmp/failed_build in CircleCI permits all logs created to be outputed in next task on CI, which otherwise would not be readable since -j24 ruins the output.

So the next task picks up created logs and outputs them on CI with delimitators: if [[ -f /tmp/failed_build ]]; then find ./build/ -name "*.log" -type f -mmin -1|while read log; do echo ""; echo '==>' "$log" '<=='; echo ""; cat $log;done; exit 1;else echo "Not failing. Continuing..."; fi

The resulting logs are concatenated here with ==> and <== separators.

tlaurion commented 3 years ago

@osresearch : i'm redoing a build without decompressing your cache file, which failed on CircleCI for your latest commit, since host binaries were not found on fed config.status (sed, gawk and other host binaries were not found on provided paths).

A CI build is happening here

tlaurion commented 2 years ago

@osresearch seems like the missing culprit was in commenting MAKE_JOBS under global Mafefile on PR #984 (and deleting the configure cache which doesn't work under CircleCI).

Then a weird race condition happens only under newt module build, which can be hacked by forcing that module to be built only with one job. Will do a replacement PR once working around new CircleCI limitations pass.

tlaurion commented 2 years ago

@osresearch @bwachter : I was successful into fixing partly this issue in that commit: https://github.com/osresearch/heads/pull/1015/commits/5e4309c1e19b58cb6c2ca9cddd11893c83afaa71

Where MAKE_JOBS was commented out into the main Makefile. Improvements welcome.

tlaurion commented 2 years ago

Tag me to reopen. CircleCI builds in parallel with 36 cores now under 45 minutes on clean checkout.