Open uttampawar opened 4 years ago
Hi Uttam,
The main.cc callee.cc example was to demonstrate how Propeller does basic block reordering. I would be really surprised if this improved performance noticeably in any manner. This is a simple "Hello world" where you can verify via nm that Propeller is indeed doing what you would expect with the expected basic blocks being reordered.
Propeller is really effective on programs which are front end bound like the clang benchmark.
Thanks Sri
On Thu, Oct 24, 2019 at 3:54 PM Uttam Pawar notifications@github.com wrote:
Hi, I'm not able to observe the performance benefit due to propeller toolchain for the included test program (main.cc, callee.cc). Followed the steps given in Propeller_RFC.pdf.
High level observations:
- Elapsed time doesn't show any improvement.
- cycles and instruction, branch mispredicts are almost same
- overall cache-misses are lower but L1-icache-load-misses are similar
$ time ./a.out.orig.labels 1000000000 2 >& /dev/null real 0m21.094s user 0m20.489s sys 0m0.604s
$ time ./a.out.labels 1000000000 2 >& /dev/null real 0m20.357s user 0m19.908s sys 0m0.448s
Elapsed time varies from 1 to 5%. Perf data
$ perf stat -e cycles,instructions,cache-misses,L1-icache-load-misses,br_misp_retired.all_branches,br_inst_retired.all_branches,icache_64b.iftag_stall ./a.out.o rig.labels 1000000000 1> /dev/null
Performance counter stats for './a.out.orig.labels 1000000000':
80,231,347,233 cycles (66.67%)
243,314,361,618 instructions # 3.03 insn per cycle (83.33%) 22,522 cache-misses (83.33%) 2,644,077 L1-icache-load-misses (83.33%) 20,400,061 br_misp_retired.all_branches (83.33%) 53,442,616,374 br_inst_retired.all_branches (83.34%) 68,554,744 icache_64b.iftag_stall (57.14%)
21.191516400 seconds time elapsed
Optimized binary
$ perf stat -e cycles,instructions,cache-misses,L1-icache-load-misses,br_misp_retired.all_branches,br_inst_retired.all_branches,icache_64b.iftag_stall ./a.out.l abels 1000000000 1> /dev/null
Performance counter stats for './a.out.labels 1000000000':
81,446,698,907 cycles (66.66%)
243,218,220,681 instructions # 2.99 insn per cycle (83.33%) 14,907 cache-misses (83.34%) 2,533,002 L1-icache-load-misses (83.34%) 20,571,010 br_misp_retired.all_branches (83.34%) 53,455,580,211 br_inst_retired.all_branches (83.33%) 68,847,492 icache_64b.iftag_stall (57.14%)
21.512644234 seconds time elapsed
The referenced paper doesn't mention the benefit for the included test program. What is expected improvement for the included test?
Please see more details (build, runtime steps, etc.) in following gist. https://gist.github.com/uttampawar/5407f998bc3f02f58c4b83b0b4dc20fe
Any hint is appreciated.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google/llvm-propeller/issues/3?email_source=notifications&email_token=AJJPQRZPEYHF7PUBRKMD6Q3QQIRRVA5CNFSM4JE4OKA2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HUH27DQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJJPQR7BOV2NCAR67JXU3STQQIRRVANCNFSM4JE4OKAQ .
@tmsri Thanks, this makes sense. Across, I'm going to apply this methodology to couple of runtimes (PHP and Node.js/V8) where I've seen > 35% CPU front-end bound bottleneck.
Here is an specific example of running Ghost.js (Node.js workload), I see large front-end bound stalls. TMAM_Frontend_Bound(%) 37.11% TMAM_ITLB_Misses(%) 6.81% TMAM_Bad_Speculation(%) 10.58%
These numbers are derived using the TMAM Methodology
We have fixed or at least reduced the ITLB_Misses by 3% with use of large_pages (Node.js optimizations) which also reduced Frontend_stall by the same amount.
I want to apply propeller toolchain to see if it can help reduce these stalls with optimal code layout.
On Fri, Oct 25, 2019 at 9:57 AM Uttam Pawar notifications@github.com wrote:
@tmsri https://github.com/tmsri Thanks, this makes sense. I'm going to apply this methodology to couple of runtimes (PHP and Node.js/V8) where I've seen > 35% CPU front-end bound bottleneck.
Sounds good, please let us know how this goes. We are happy to help in case you run into issues.
Sri
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/llvm-propeller/issues/3?email_source=notifications&email_token=AJJPQR3RWW2VCKWJLWYHBXLQQMQP7A5CNFSM4JE4OKA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECI6DKA#issuecomment-546431400, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJJPQRYRXFN4PLPD5CK4YOLQQMQP7ANCNFSM4JE4OKAQ .
Note: I tried to use propeller optimization technique on two workloads but don't see any benefit.
Here is a new data point for the 'node' binary which includes 'd8' JavaScript engine which is built with llvm-propeller and a "webtooling" workload (https://github.com/v8/web-tooling-benchmark).
Performance data using "perf" is in the gist at, https://gist.github.com/uttampawar/759078fa749170b6b75815874a81162a
On the side note, can someone provide me instructions on how to use 'propeller' steps to build clang compiler and verify the benefit with 'clang' compiler as described in the paper? I appreciate the help. TIA.
Hi Uttam,
On Mon, Nov 25, 2019 at 6:04 PM Uttam Pawar notifications@github.com wrote:
Note: I tried to use propeller optimization technique on two workloads but don't see any benefit.
Here is a new data point for the 'node' binary which includes 'd8' JavaScript engine which is built with llvm-propeller and a "webtooling" workload (https://github.com/v8/web-tooling-benchmark).
Performance data using "perf" is in the gist at, https://gist.github.com/uttampawar/759078fa749170b6b75815874a81162a
We will try this out too and let you know. Do you know how front end bound this benchmark is?
On the side note, can someone provide me instructions on how to use 'propeller' steps to build clang compiler and verify the benefit with 'clang' compiler as described in the paper?
There is a "plo" directory that is checked in as part of google/llvm-propeller. Do the following:
$ cd plo $ make check-performance
This will build clang with propeller and you should be able to verify the performance.
Thanks Sri
I appreciate the help. TIA.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/llvm-propeller/issues/3?email_source=notifications&email_token=AJJPQR7HO4QE7Y7DVGTAJFDQVR72HA5CNFSM4JE4OKA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFEOHDA#issuecomment-558424972, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJJPQRZCQTTSJTJHARLZQ4TQVR72HANCNFSM4JE4OKAQ .
Sri, Thanks for your response. I'll try your suggestion to verify clang performance.
Following are the details about frontend-bound stalls (calculated using TMAM methodology while workload is in steady state) Webtooling: metric_TMAM_Frontend_Bound(%) 23% metricTMAM....ICache_Misses(%) 4% metricTMAM....ITLB_Misses(%) 3% metric_TMAM_Bad_Speculation(%) 11% metric_cycles per txn 37,910,952.8403 (normalized per trasaction) INST_RETIRED.ANY 833,337,440 (normalized per transaction)
And Ghost.js (original): metric_TMAM_Frontend_Bound(%) 35% metricTMAM....ICache_Misses(%) 9% metricTMAM....ITLB_Misses(%) 6% metric_TMAM_Bad_Speculation(%) 10% metric_cycles per txn 480,563,097 (normalized per transaction) INST_RETIRED.ANY 54,570,282 (normalized per request or transaction)
@tmsri make check-performance failed.
cherEmitter3runERN4llvm11raw_ostreamE.$45fc0800c800f0679e2102212a27a313.llvm.6216348528576891080)
referenced 378 more times
... ld.lld: error: undefined symbol: llvm::Record::getValueAsInt(llvm::StringRef) const
referenced by AsmMatcherEmitter.cpp /mnt/sdb1/upawar/llvm-propeller/pl .. ... dcd29.tmp.o:((anonymous namespace)::CallingConvEmitter::EmitAction(llvm::Record*, unsigned int, llvm::raw_ostream&) (.$4422eb9397b83d587c5eecd7b780bf63)) referenced 97 more times
ld.lld: error: too many errors emitted, stopping now (use -error-limit=0 to see all errors) clang-10: error: linker command failed with exit code 1 (use -v to see invocation) ninja: build stopped: subcommand failed. Makefile:189: recipe for target 'pgo/build/bin/clang-10' failed make: *** [pgo/build/bin/clang-10] Error 1 This looks like error during LTO build.
BTW, is there is a way to test/verify the benefit just between default clang and clang with propeller (without PGO or LTO)?
On Tue, Nov 26, 2019 at 10:43 AM Uttam Pawar notifications@github.com wrote:
@tmsri https://github.com/tmsri make check-performance failed.
cherEmitter3runERN4llvm11raw_ostreamE.$45fc0800c800f0679e2102212a27a313.llvm.6216348528576891080)
referenced 378 more times
... ld.lld: error: undefined symbol: llvm::Record::getValueAsInt(llvm::StringRef) const
referenced by AsmMatcherEmitter.cpp /mnt/sdb1/upawar/llvm-propeller/pl
Sorry about that, I think we synced yesterday and this caused a break. We will fix this asap.
.. ... dcd29.tmp.o:((anonymous namespace)::CallingConvEmitter::EmitAction(llvm::Record*, unsigned int, llvm::raw_ostream&) (.$4422eb9397b83d587c5eecd7b780bf63)) referenced 97 more times
ld.lld: error: too many errors emitted, stopping now (use -error-limit=0 to see all errors) clang-10: error: linker command failed with exit code 1 (use -v to see invocation) ninja: build stopped: subcommand failed. Makefile:189: recipe for target 'pgo/build/bin/clang-10' failed make: *** [pgo/build/bin/clang-10] Error 1 This looks like error during LTO build.
BTW, is there is a way to test/verify the benefit just between default clang and clang with propeller (without PGO or LTO)?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/llvm-propeller/issues/3?email_source=notifications&email_token=AJJPQR6ME57GJ7AJEN3GS3DQVVU6HA5CNFSM4JE4OKA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFHBL6I#issuecomment-558765561, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJJPQRZLYUD6GQTH2KCRXTLQVVU6HANCNFSM4JE4OKAQ .
@uttampawar I think the error should go away if you do a fresh clone of the repository. (We no longer have pgo as a build directory but we have pgo-vanilla, pgo-labels, and pgo-relocs).
Cool. I'll give it a try. Thanks.
@rlavaee The performance test with "make check-performance" progressed but didn't succeed completely.
It looks like pgo-labels build failed. Looking into plo/pgo-labels I found,
$ cd plo $ ls -l pgo-labels/build/bin/ total 16 -rw-rw-r-- 1 upawar upawar 0 Nov 26 15:31 clang-10 -rw-rw-r-- 1 upawar upawar 8596 Nov 26 15:30 gen_ast_dump_json_test.py -rwxrwxr-x 1 upawar upawar 2247 Nov 26 15:30 llvm-lit
Other binaries I found in "plo" ... $ ls -l stage1/build/bin/clang-10 -rwxrwxr-x 1 upawar upawar 76390272 Nov 26 15:18 stage1/build/bin/clang-10
$ ls -l stage-pgo-labels/build/bin/clang-10 -rwxrwxr-x 1 upawar upawar 138720464 Nov 26 15:29 stage-pgo-labels/build/bin/clang-10
$ ls -l pgo-vanilla/build/bin/clang-10 (0 byte file) -rw-rw-r-- 1 upawar upawar 0 Nov 26 15:26 pgo-vanilla/build/bin/clang-10
$ ls -l stage-pgo-vanilla/build/bin/clang-10 -rwxrwxr-x 1 upawar upawar 138057128 Nov 26 15:24 stage-pgo-vanilla/build/bin/clang-10
My environment: gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1) LLVM-propeller: commit c2e699365540111e0e2a7187deda45e4b89333a0
Change in paths.mk file,
LLVM_DIR=/mnt/sdb1/upawar/propeller-work
BUILD_DIR=/mnt/sdb1/upawar/propeller-work/llvm-propeller/build
RELEASE_LLVM_BIN=/mnt/sdb1/upawar/propeller-work/llvm-propeller/build/bin
CREATE_LLVM_PROF_DIR=..
$ cd plo; make check-performance
Am I missing something or environmental issue? Any help is appreciated. TIA.
@tmsri @rlavaee See complete log in a gist at, https://gist.github.com/uttampawar/8f6d1ec0c9627d50066cb2e0ca35859a
Hi, Uttam, sorry for the late reply, just got back from Thanksgiving.
I see lots of gold linker plugin errors in building pgo-lables. In theory, gold linker should never be involved in the whole process. Let me dig a little bit to see how this could happen.
Thanks, Han
On Wed, Nov 27, 2019 at 12:04 PM Uttam Pawar notifications@github.com wrote:
@tmsri https://github.com/tmsri @rlavaee https://github.com/rlavaee See complete log in a gist at, https://gist.github.com/uttampawar/8f6d1ec0c9627d50066cb2e0ca35859a
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/llvm-propeller/issues/3?email_source=notifications&email_token=AELA2QUWWFGZIKQ3XPMPYALQV3HEPA5CNFSM4JE4OKA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFKTT7I#issuecomment-559233533, or unsubscribe https://github.com/notifications/unsubscribe-auth/AELA2QUHS5NOP2BNC5JMJR3QV3HEPANCNFSM4JE4OKAQ .
-- Han Shen | Software Engineer | shenhan@google.com | +1-650-440-3330
@shenhanc78 Okay. Thanks for the followup.
@shenhanc78 Any update? TIA.
Hi Uttam, yup. I've just pushed a new version which forces lld to be used along all builds. You may pull / sync and have another try.
I'm happy to help if any further problems.
Thanks! -Han
On Fri, Dec 6, 2019 at 10:32 AM Uttam Pawar notifications@github.com wrote:
@shenhanc78 https://github.com/shenhanc78 Any update? TIA.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/llvm-propeller/issues/3?email_source=notifications&email_token=AELA2QWS6O4IYEUL4E6R2PDQXKLDTA5CNFSM4JE4OKA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGE6ZZA#issuecomment-562687204, or unsubscribe https://github.com/notifications/unsubscribe-auth/AELA2QWZ3DR544MZVRE5Z3DQXKLDTANCNFSM4JE4OKAQ .
-- Han Shen | Software Engineer | shenhan@google.com | +1-650-440-3330
That's great. I'll give it a try. Thanks.
hi @uttampawar can you share how you built the node binary for your optimization experiment?
Hi, I'm not able to observe the performance benefit due to propeller toolchain for the included test program (main.cc, callee.cc). Followed the steps given in Propeller_RFC.pdf.
High level observations: 1) Elapsed time doesn't show any improvement. 2) cycles and instruction, branch mispredicts are almost same 3) overall cache-misses are lower but L1-icache-load-misses are similar
$ time ./a.out.orig.labels 1000000000 2 >& /dev/null real 0m21.094s user 0m20.489s sys 0m0.604s
$ time ./a.out.labels 1000000000 2 >& /dev/null real 0m20.357s user 0m19.908s sys 0m0.448s
Elapsed time varies from 1 to 5%.
Perf data
$ perf stat -e cycles,instructions,cache-misses,L1-icache-load-misses,br_misp_retired.all_branches,br_inst_retired.all_branches,icache_64b.iftag_stall ./a.out.o rig.labels 1000000000 1> /dev/null
Performance counter stats for './a.out.orig.labels 1000000000':
243,314,361,618 instructions # 3.03 insn per cycle (83.33%) 22,522 cache-misses (83.33%) 2,644,077 L1-icache-load-misses (83.33%) 20,400,061 br_misp_retired.all_branches (83.33%) 53,442,616,374 br_inst_retired.all_branches (83.34%) 68,554,744 icache_64b.iftag_stall (57.14%)
Optimized binary
$ perf stat -e cycles,instructions,cache-misses,L1-icache-load-misses,br_misp_retired.all_branches,br_inst_retired.all_branches,icache_64b.iftag_stall ./a.out.l abels 1000000000 1> /dev/null
Performance counter stats for './a.out.labels 1000000000':
243,218,220,681 instructions # 2.99 insn per cycle (83.33%) 14,907 cache-misses (83.34%) 2,533,002 L1-icache-load-misses (83.34%) 20,571,010 br_misp_retired.all_branches (83.34%) 53,455,580,211 br_inst_retired.all_branches (83.33%) 68,847,492 icache_64b.iftag_stall (57.14%)
The referenced paper doesn't mention the benefit for the included test program. What is expected improvement for the included test?
Please see more details (build, runtime steps, etc.) in following gist. https://gist.github.com/uttampawar/5407f998bc3f02f58c4b83b0b4dc20fe
Any hint is appreciated.