Open Eclips4 opened 1 year ago
I'd suggest to implement -j threads
rather than running on all available cores. I'm using pyperformance
for gathering profiling data in own weird Python PGO+LTO build.
@rockdrilla
I'm using pyperformance for gathering profiling data in own weird Python PGO+LTO build.
If you don't mind, would you like to introduce your use case? These days I have an interest in increasing the coverage of profiling for PGO.
@corona10 it's pretty weird solution. :smile:
In short: build Python with shared library, install "somewhere" pyperformance using "shared" Python, reconfigure Python for static binary and build it - it will run pyperformance while gathering PGO data.
applied patches (related for this case):
build script: debian/rules from package template
upd: benchmark results.
upd: benchmark results.
Wow supercool!
I am still conservative with direct supporting profiling workload based on pyperformance suite but I am open to improving the current PGO and LTO for better performance. (For example, thinLTO is fast but fullLTO based on GCC is slow if you don't pass the core count or auto flag) or we can create a new configuration for designating the pre-gen profile directory, which can be used for external profiled data.
I'd suggest to using fixed core count rather than "auto" in -flto=X
because I've seen a lot of times situation where (gnu) make launches up to N
(in case of make -j N
) jobs with gcc and each (!) gcc spawned up to N
processes. It's confusing me a lot but container runtime confuses whole build process even more when running in a container with limited core count.
upd: you may use -fprofile-dir=path
flag with gcc in order to separate PGO data from build directory.
I'd suggest to using fixed core count rather than "auto" in -flto=X because I've seen a lot of times situation where (gnu) make launches up to N (in case of make -j N) jobs with gcc and each (!) gcc spawned up to N processes
Okay, I agree with you. Let's pile the issue on the CPython and discuss the better way to solve it. I prefer that we can use seamless ways to support it.
On the current moment, pyperformance loads a single core Is there any reason why this is so?