Closed ae6878ec-d873-4297-8b64-d05d471e780d closed 8 years ago
Hi All,
This is Alecsandru from Server Scripting Languages Optimization team at Intel Corporation.
I would like to submit a request to turn-on Profile Guided Optimization or PGO as the default build option for Python (both 2.7 and 3.6), given its performance benefits on a wide variety of workloads and hardware. For instance, as shown from attached sample performance results from the Grand Unified Python Benchmark, >20% speed up was observed. In addition, we are seeing 2-9% performance boost from OpenStack/Swift where more than 60% of the codes are in Python 2.7. Our analysis indicates the performance gain was mainly due to reduction of icache misses and CPU front-end stalls.
Attached is the Makefile patches that modify the all build target and adds a new one called "disable-profile-opt". We built and tested this patch for Python 2.7 and 3.6 on our Linux machines (CentOS 7/Ubuntu Server 14.04, Intel Xeon Haswell/Broadwell with 18/8 cores). We use "regrtest" suite for training as it provides the best performance improvement. Some of the test programs in the suite may fail which leads to build fail. One solution is to disable the specific failed test using the "-x " flag (as shown in the patch)
Steps to apply the patch:
To disable PGO 7b. make disable-profile-opt
In the following, please find our sample performance results from latest XEON machine, XEON Broadwell EP.
Hardware (HW): Intel XEON (Broadwell) 8 Cores
BIOS settings: Intel Turbo Boost Technology: false Hyper-Threading: false
Operating System: Ubuntu 14.04.3 LTS trusty
OS configuration: CPU freq set at fixed: 2.6GHz by echo 2600000 > /sys/devices/system/cpu/cpu/cpufreq/scaling_min_freq echo 2600000 > /sys/devices/system/cpu/cpu/cpufreq/scaling_max_freq Address Space Layout Randomization (ASLR) disabled (to reduce run to run variation) by echo 0 > /proc/sys/kernel/randomize_va_space
GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1\~14.04)
Benchmark: Grand Unified Python Benchmark (GUPB) GUPB Source: https://hg.python.org/benchmarks/
Python2.7 results: Python source: hg clone https://hg.python.org/cpython cpython Python Source: hg update 2.7 hg id: 0511b1165bb6 (2.7) hg id -r 'ancestors(.) and tag()': 15c95b7d81dc (2.7) v2.7.10 hg --debug id -i: 0511b1165bb6cf40ada0768a7efc7ba89316f6a5
Benchmarks Speedup(%)
simple_logging 20
raytrace 20
silent_logging 19
richards 19
chaos 16
formatted_logging 16
json_dump 15
hexiom2 13
pidigits 12
slowunpickle 12
django_v2 12
unpack_sequence 11
float 11
mako 11
slowpickle 11
fastpickle 11
django 11
go 10
json_dump_v2 10
pathlib 10
regex_compile 10
pybench 9.9
etree_process 9
regex_v8 8
bzr_startup 8
2to3 8
slowspitfire 8
telco 8
pickle_list 8
fannkuch 8
etree_iterparse 8
nqueens 8
mako_v2 8
etree_generate 8
call_method_slots 7
html5lib_warmup 7
html5lib 7
nbody 7
spectral_norm 7
spambayes 7
fastunpickle 6
meteor_contest 6
chameleon 6
rietveld 6
tornado_http 5
unpickle_list 5
pickle_dict 4
regex_effbot 3
normal_startup 3
startup_nosite 3
etree_parse 2
call_method_unknown 2
call_simple 1
json_load 1
call_method 1
Python3.6 results
Python source: hg clone https://hg.python.org/cpython cpython
hg id: 96d016f78726 tip
hg id -r 'ancestors(.) and tag()': 1a58b1227501 (3.5) v3.5.0rc1
hg --debug id -i: 96d016f78726afbf66d396f084b291ea43792af1
Benchmark Speedup(%)
fastunpickle 22.94
fastpickle 21.67
json_load 17.64
simple_logging 17.49
meteor_contest 16.67
formatted_logging 15.33
etree_process 14.61
raytrace 13.57
etree_generate 13.56
chaos 12.09
hexiom2 12
nbody 11.88
json_dump_v2 11.24
richards 11.02
nqueens 10.96
fannkuch 10.79
go 10.77
float 10.26
regex_compile 9.8
silent_logging 9.63
pidigits 9.58
etree_iterparse 9.48
2to3 8.44
regex_v8 8.09
regex_effbot 7.88
call_simple 7.63
tornado_http 7.38
etree_parse 4.92
spectral_norm 4.72
normal_startup 4.39
telco 3.88
startup_nosite 3.7
call_method 3.63
unpack_sequence 3.6
call_method_slots 2.91
call_method_unknown 2.59
iterative_count 0.45
threaded_count -2.79
Thank you, Alecsandru
Please upload your patches as separate, uncompressed files for review.
PGO was already proposed here, but nothing came out of it:
https://bugs.python.org/issue17781
I suggest rejecting that old ticket and sticking with this one since it has an actual patch.
I added the patches as individual files and removed the zip file.
Is this supposed to work on Macs using Apple's version of gcc? I've got the latest version of Yosemite and XCode, and am getting these warnings when trying to build 2.7:
clang: warning: argument unused during compilation: '-fprofile-generate'
Should this be enabled using a configure check? Perhaps gcc/clang supports this but spells the feature differently. gcc --help tells me:
% gcc --help | egrep profile -fprofile-instr-generate -fprofile-instr-use=\<value> Use instrumentation data for profile-guided optimization -fprofile-sample-use=\<value> Enable sample-based profile guided optimizations
The patches are tested on Linux machines, with GNU GCC >4.8.3. From your output I see that you are using the CLANG compiler. CLANG uses a different set of flags for PGO that are not compatible with GCC's, therefore the compilation will fail. Can you please use the GNU GCC compiler to test the patches?
It is executed using the gcc command:
% gcc -c -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fprofile-generate -I. -IInclude -I./Include -DPy_BUILD_CORE -o Modules/gcmodule.o Modules/gcmodule.c clang: warning: argument unused during compilation: '-fprofile-generate' % type gcc gcc is /usr/bin/gcc
I have no idea if you can even use something other than clang on Macs now. In any case, the default compiler should work to build Python out of the box, if necessary, by checking things during configure.
I did an initial code review on the 3.6 patch.
What would it take to add clang support for PGO? Is it simply using different flags that configure can set in the generated Makefile? Or is it more involved and would require maintaining two separate compile lines in the Makefile?
I received the review and will post new patch versions as soon as I update them.
Regarding PGO on clang, I will need a bit more time to edit the Makefile and will post it just for clang, to be easier for us to see the differences.
I modified the patches after the review made by Brett (python2.7-pgo-v02.patch and python3.6-pgo-v02.patch):
I also added the patches for Mac exclusively (python2.7-pgo-v02-mac.patch and python3.6-pgo-v02-mac.patch). You must have llvm-profdata installed and in your path (in /Library/Developer/CommandLineTools/usr/bin/) to use it. I tested on Yosemite and it compiles fine with clang. I am working on a generic version (configure and Makefile patches) to merge all these platforms and will post them as soon as it is done.
My initial reaction is that the default should be optimized for short build times. I would not want to type "disable-profile-opt" every time I'm running the tests.
I see that your latest patch leaves PGO as an option, so please ignore my previous comment.
the v02 patches LGTM. I'm fine with seeing those committed as-is knowing Alecsandru is actively working towards Clang support.
i'm updating the title to be more accurate.
turning it on by default is likely not desirable as the makefile is primarily used by developers who are iterating on changes.
but having it use a good workload (regrtest) and work with llvm and os x are good. :)
I modified the patches to be compatible with both environments. The new versions modify the configure.ac file also, therefore you will need to run "autoconf" by hand. Also, in case of MaOS you will need to have llvm-profdata installed and in your path.
I kept the expanded form of regrtest (/Lib/test/regrtest.py) because this way it is clearer to the user what is the main file that runs the training workload.
Also, the "|| true" is necessary also, due to the nature of regrtest. This test suite is designed to return a fail code if a test is not ok, even for tests that do not comply with certain dependencies (meaning users that didn't installed any other libraries).
Any specific reason the v3 patch, Alecsandru, is listed as against 3.5 in the filename? Or is that just a typo?
P.S.: I did another review asking about explicit Clang support and also supporting Greg's request to use -m test
instead of the explicit file execution.
Sorry, it was a typo. I made a correction to it. I will also modify to -m flag, instead of the explicit file execution.
Regarding the clang/gcc support, in v03 version of patches, GCC is supported. On Linux is straightforward. On Mac I see that the default development environment also has the "gcc" command, but it is a binary stub that calls clang in backend, so the flags are adjusted for clang-in-gcc-clothing. You say to support clang explicitly as a compiler in 2.7 and 3.6?
I'm asking if that's possible. For instance I set $CC to clang explicitly on OS X as I install the latest version of LLVM through Homebrew to get better compiler warnings for Python. It would be great if we could avoid leaving all clang users out unless they happen to use the stock install on OS X (e.g. cover Clang users on Linux).
Basically it would be nice if this is not exclusive to gcc if Clang also supports PGO.
Thank you for the clarifications! Your point make sense, we don't want to exclude clang environments. I will analyze this and post some patches once I'm done with it.
I modified the patches with clang support.
Also, I added an important check for the architecture on which PGO is running. Our proposal targets x86 platforms, since our measurements are made only on x86 hardware.
I fixed the files after the review. Regarding the PROFILE_OPT_OLD line, I think that it is better to keep also the old task used for PGO, until clear evidence and measurements that regrtest is performing better on other architectures exists.
Can you explain what the profile-merging thing is achieving?
The profile merging is necessary in case you want to use a pure clang compiler or you use GCC in OSX. For example, a general profiling action using clang will result in at least one binary profile. For our case, when using regrtest, we will have multiple profiles as the test is a multi-process one. The application llvm-profdata has the ability to merge the information collected from multiple processes, thus having a more precise map of what is executed from the profiled application.
This step is mandatory even if we train on a single threaded or single process workload and have just one profile. More information about the entire process can be found here: http://clang.llvm.org/docs/UsersManual.html#profiling-with-instrumentation
I did another round of review. I noticed that the configure part of the patch is missing and that .hgignore and .gitignore should get updated to ignore the profile files. Otherwise the only other comment was making an echoed comment a bit clearer.
And in case anyone else is on OS X Yosemite and gets an error about llvm-profdata missing, make sure that /Library/Developer/CommandLineTools/usr/bin is on your $PATH.
I've updated the patches after review and implemented the checkup for llvm-profdata for both Linux and OSX.
The latest patch worked fine for me (Mac OS X Yosemite). I've only tried with 2.7 so far. The only thing that was a bit mystifying were the errors during the initial profile run. There is so much that floats by in the terminal window that I completely missed the warnings about errors during the test run not being anything to worry about. I only noticed the messages when I took a look at the patch more closely.
Perhaps it would be worthwhile to add a short bit about the profile-opt target and its requirements to the README file.
Not knowing a darn thing about this, I went ahead and made a provisional change to the README file.
That's a good point Skip. I added another set of patches, just for the README files, explaining the entire procedure, so now anyone reading it will see that PGO is available, what are the steps involved and a brief comment about the warning.
The only thing that was a bit mystifying were the errors during the initial profile run. There is so much that floats by in the terminal window that I completely missed the warnings about errors during the test run not being anything to worry about.
Then wouldn't it be better to suppress (or at least reduce) the output of the test runs in this case?
I guess the test output -- both stdout and stderr -- could be redirected to /dev/null as simply using -q with regrtest will still lead to failures being emitted and random output which no one cares about except people inspecting the test output. Just need to make sure to mention that all output is suppressed so people don't think the process is hanging.
I've updated the patches with redirect to /dev/null, as is it is more clearer to the user what is our intent, without having him to necessarily read the regrtest documentation. I've also added a warning message regarding the output and ported all these lines to 3.6 and to the README files also.
Please don't call it "PROFILE_TASK_X86" - the architecture should have nothing to do with it. Actually, there shouldn't be any architecture-specific check at all.
As for the dual 2.7/3.6 aspect: I don't really understand it. If this is committed to 2.7 it should also be committed to 3.5. It doesn't threaten the stability of the interpreter in any way, given it does not affect the default build path. There's no reason why packagers of Python 3.5 should have to separately maintain a patch to have access to this improvement.
I named this task PROFILE_TASK_X86 because it is rigorously tested and we have proven that regrtest performs better on this architecture. Until any other clear evidence and solid measurements that regrtest is performing better on other architectures exists, I'd keep it this way.
Even though this does not threaten the stability of the interpreter in any way, the dual aspect you mentioned appears because CPython 2 and 3 have slightly different makefile rule format. To create a common patch working cross-versions will create a very tangled Makefile. If you all agree that having an unified patch for both versions is acceptable, I will work on that.
I don't think we should provide any performance guarantees in the Makefile. +1 for not special-casing x86 (does it include amd64?).
As I understood, Antoine was not talking about a unified patch but about applying the 3.6 patch to 3.5 right away.
If you are talking just about the 3.6 patch, it is called this way to emphasize the fact that it is intended for the development branch. It is perfectly compatible with 3.5, therefore it is not needed for packagers to maintain two distinct versions. I've tested with: hg update 3.5 ; hg import --no-commit python3.6-pgo-v07.patch ; ./configure ; make profile-opt
I also renamed the profile task makefile name.
Just (hopefully) for extra clarity: As you mentioned, the 3.6 patch is perfect for 3.5, too. The reason why 3.5 was brought up is to ask Larry, our release manager, to allow it already for 3.5.
Technically it's an enhancement/new feature, but practically it is zero risk and for PR reassons we should probably not "make 2.7 faster" before 3.5.
Attached is what I plan to commit to Python 2.7 assuming everyone is happy with the outcome. I tweaked the echoed messages from Alecsandru's patch, pulled in the README changes, and dropped the x86 checks as Antoine and Stefan requested.
Assuming people are happy with the patch I will also apply it to Python 3.5 with the appropriate tweaks.
New changeset 0f4e6c303531 by Brett Cannon in branch '2.7': Issue bpo-24915: Make PGO builds support Clang and use the test suite for https://hg.python.org/cpython/rev/0f4e6c303531
New changeset f211c8f554f9 by Brett Cannon in branch '2.7': Give proper credit for issue bpo-24915 https://hg.python.org/cpython/rev/f211c8f554f9
New changeset 7fcff838d09e by Brett Cannon in branch '3.5': Issue bpo-24915: Add Clang support to PGO builds and use the test suite https://hg.python.org/cpython/rev/7fcff838d09e
New changeset 7749fc0a5ea6 by Brett Cannon in branch 'default': Merge for issue bpo-24915 https://hg.python.org/cpython/rev/7749fc0a5ea6
Thanks to Alecsandru and Intel for the patches!
Thank you Brett for committing this.
Hum, the change 7fcff838d09e broke the buildbot "AMD64 Debian PGO 3.5". It would nice to add Clang support without loosing GCC support :-D
http://buildbot.python.org/all/builders/AMD64%20Debian%20PGO%203.5/builds/274
It didn't break gcc, the buildbot simply wasn't patient enough for the PGO run of the test suite to complete: http://buildbot.python.org/all/builders/AMD64%20Debian%20PGO%203.5/builds/274/steps/compile/logs/stdio . It takes a good amount of time to run the test suite serially with an instrumented interpreter and 20 minutes is not enough time. And I don't want to add output back simply to appease the buildbot as the output means nothing to a user who is doing the build themselves.
So either that buildbot needs to allow for a longer time without output, someone needs to come up with a way to simply emit some output that simply shows stuff is running (but without letting error condition stuff show up), or the buildbot just won't work with PGO.
Le 19/09/2015 20:18, Brett Cannon a écrit :
And I don't want to add output back simply to appease the buildbot as the output means nothing to a user who is doing the build themselves.
The output is actually a good indication of progress, so I don't think it's not as silly to add it back as you seem to think it is :-)
The problem with the output is that error cases are unimportant and yet it fooled Skip into temporarily caring until he finally noticed the warning message. So my worry is that someone doesn't notice the "NOTE: ignore errors as they don't affect anything" and then glances at the output to notice an error and then worries that their PGO run failed.
It people really want to add output back in, though, they will need to patch both the Makefile to have a big NOTE in it as well as the README to say that any errors during the test suite run are unimportant and do not affect the outcome of the profile-guided optimizations.
Would it be possible to grep out the warning messages, but let everything else through? On Sep 19, 2015 1:34 PM, "Brett Cannon" \report@bugs.python.org\ wrote:
Brett Cannon added the comment:
The problem with the output is that error cases are unimportant and yet it fooled Skip into temporarily caring until he finally noticed the warning message. So my worry is that someone doesn't notice the "NOTE: ignore errors as they don't affect anything" and then glances at the output to notice an error and then worries that their PGO run failed.
It people really want to add output back in, though, they will need to patch both the Makefile to have a big NOTE in it as well as the README to say that any errors during the test suite run are unimportant and do not affect the outcome of the profile-guided optimizations.
----------
Python tracker \report@bugs.python.org\ \http://bugs.python.org/issue24915\
Thank you for upstreaming this in both branches of Python! Do you think that a different version of regrtest.py, that will be used only for PGO training, should be better in this case? I mean, by implementing a custom version, I think we can control better the output and errors shown on screen.
I gave the custom test runner a try using unittest's discovery facility, but it started to execute the whole test suite again, so it's a bit more complicated than you might think (I guess it imported regrtest or something?).
Instead of writing a custom test runner from scratch, I would suggest adding a hidden --option to regrtest that would disable reporting errors.
I can work on modifying the existing regrtest and adding a distinct flag, --pgo for example, as Antoine suggested. Indeed, it will not be trivial as regrtest has a dual approach (single process and multi process), but I will give it a try and post a patch as soon as possible.
I also suggest that I open a new issue for this case as it is somehow a distinct implementation than pure PGO and definitively will be some iterations on regrtest.py for both versions of Python until we reach a common ground. It is ok for everyone?
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = 'https://github.com/brettcannon' closed_at =
created_at =
labels = ['build', 'performance']
title = 'Profile Guided Optimization improvements (better training, llvm support, etc)'
updated_at =
user = 'https://bugs.python.org/alecsandrupatrascu'
```
bugs.python.org fields:
```python
activity =
actor = 'brett.cannon'
assignee = 'brett.cannon'
closed = True
closed_date =
closer = 'brett.cannon'
components = ['Build']
creation =
creator = 'alecsandru.patrascu'
dependencies = ['25188']
files = ['40231', '40232', '40236', '40237', '40238', '40239', '40254', '40255', '40267', '40268', '40273', '40274', '40288', '40289', '40297', '40299', '40300', '40389', '40390', '40391', '40392', '40441', '40522', '41916', '41917']
hgrepos = []
issue_num = 24915
keywords = ['patch']
message_count = 63.0
messages = ['248988', '248992', '249002', '249004', '249006', '249008', '249013', '249014', '249052', '249053', '249055', '249061', '249071', '249128', '249131', '249143', '249146', '249155', '249200', '249227', '249246', '249256', '249286', '249315', '249333', '249336', '249355', '250008', '250011', '250093', '250095', '250096', '250099', '250100', '250102', '250105', '250489', '251033', '251034', '251035', '251036', '251065', '251090', '251091', '251096', '251110', '251112', '251125', '251126', '251127', '251128', '251129', '252179', '252182', '252184', '259840', '259868', '259870', '259878', '259881', '259883', '260269', '260573']
nosy_count = 11.0
nosy_names = ['skip.montanaro', 'brett.cannon', 'gregory.p.smith', 'tzot', 'pitrou', 'scoder', 'vstinner', 'larry', 'skrah', 'python-dev', 'alecsandru.patrascu']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue24915'
versions = ['Python 2.7', 'Python 3.5', 'Python 3.6']
```