uttampawar commented 4 years ago

Hi, I'm not able to observe the performance benefit due to propeller toolchain for the included test program (main.cc, callee.cc). Followed the steps given in Propeller_RFC.pdf.

High level observations: 1) Elapsed time doesn't show any improvement. 2) cycles and instruction, branch mispredicts are almost same 3) overall cache-misses are lower but L1-icache-load-misses are similar

$ time ./a.out.orig.labels 1000000000 2 >& /dev/null real 0m21.094s user 0m20.489s sys 0m0.604s

$ time ./a.out.labels 1000000000 2 >& /dev/null real 0m20.357s user 0m19.908s sys 0m0.448s

Elapsed time varies from 1 to 5%.

Perf data

$ perf stat -e cycles,instructions,cache-misses,L1-icache-load-misses,br_misp_retired.all_branches,br_inst_retired.all_branches,icache_64b.iftag_stall ./a.out.o rig.labels 1000000000 1> /dev/null

Performance counter stats for './a.out.orig.labels 1000000000':

80,231,347,233      cycles                                                        (66.67%)

243,314,361,618 instructions # 3.03 insn per cycle (83.33%) 22,522 cache-misses (83.33%) 2,644,077 L1-icache-load-misses (83.33%) 20,400,061 br_misp_retired.all_branches (83.33%) 53,442,616,374 br_inst_retired.all_branches (83.34%) 68,554,744 icache_64b.iftag_stall (57.14%)

  21.191516400 seconds time elapsed

Optimized binary

$ perf stat -e cycles,instructions,cache-misses,L1-icache-load-misses,br_misp_retired.all_branches,br_inst_retired.all_branches,icache_64b.iftag_stall ./a.out.l abels 1000000000 1> /dev/null

Performance counter stats for './a.out.labels 1000000000':

81,446,698,907      cycles                                                        (66.66%)

243,218,220,681 instructions # 2.99 insn per cycle (83.33%) 14,907 cache-misses (83.34%) 2,533,002 L1-icache-load-misses (83.34%) 20,571,010 br_misp_retired.all_branches (83.34%) 53,455,580,211 br_inst_retired.all_branches (83.33%) 68,847,492 icache_64b.iftag_stall (57.14%)

  21.512644234 seconds time elapsed

The referenced paper doesn't mention the benefit for the included test program. What is expected improvement for the included test?

Please see more details (build, runtime steps, etc.) in following gist. https://gist.github.com/uttampawar/5407f998bc3f02f58c4b83b0b4dc20fe

Any hint is appreciated.

tmsri commented 4 years ago

Hi Uttam,

The main.cc callee.cc example was to demonstrate how Propeller does basic block reordering. I would be really surprised if this improved performance noticeably in any manner. This is a simple "Hello world" where you can verify via nm that Propeller is indeed doing what you would expect with the expected basic blocks being reordered.

Propeller is really effective on programs which are front end bound like the clang benchmark.

Thanks Sri

On Thu, Oct 24, 2019 at 3:54 PM Uttam Pawar notifications@github.com wrote:

Hi, I'm not able to observe the performance benefit due to propeller toolchain for the included test program (main.cc, callee.cc). Followed the steps given in Propeller_RFC.pdf.

High level observations:

Elapsed time doesn't show any improvement.

cycles and instruction, branch mispredicts are almost same

overall cache-misses are lower but L1-icache-load-misses are similar

$ time ./a.out.orig.labels 1000000000 2 >& /dev/null real 0m21.094s user 0m20.489s sys 0m0.604s

$ time ./a.out.labels 1000000000 2 >& /dev/null real 0m20.357s user 0m19.908s sys 0m0.448s

Elapsed time varies from 1 to 5%. Perf data

$ perf stat -e cycles,instructions,cache-misses,L1-icache-load-misses,br_misp_retired.all_branches,br_inst_retired.all_branches,icache_64b.iftag_stall ./a.out.o rig.labels 1000000000 1> /dev/null

Performance counter stats for './a.out.orig.labels 1000000000':

80,231,347,233 cycles (66.67%)

243,314,361,618 instructions # 3.03 insn per cycle (83.33%) 22,522 cache-misses (83.33%) 2,644,077 L1-icache-load-misses (83.33%) 20,400,061 br_misp_retired.all_branches (83.33%) 53,442,616,374 br_inst_retired.all_branches (83.34%) 68,554,744 icache_64b.iftag_stall (57.14%)

21.191516400 seconds time elapsed

Optimized binary

$ perf stat -e cycles,instructions,cache-misses,L1-icache-load-misses,br_misp_retired.all_branches,br_inst_retired.all_branches,icache_64b.iftag_stall ./a.out.l abels 1000000000 1> /dev/null

Performance counter stats for './a.out.labels 1000000000':

81,446,698,907 cycles (66.66%)

243,218,220,681 instructions # 2.99 insn per cycle (83.33%) 14,907 cache-misses (83.34%) 2,533,002 L1-icache-load-misses (83.34%) 20,571,010 br_misp_retired.all_branches (83.34%) 53,455,580,211 br_inst_retired.all_branches (83.33%) 68,847,492 icache_64b.iftag_stall (57.14%)

21.512644234 seconds time elapsed

The referenced paper doesn't mention the benefit for the included test program. What is expected improvement for the included test?

Please see more details (build, runtime steps, etc.) in following gist. https://gist.github.com/uttampawar/5407f998bc3f02f58c4b83b0b4dc20fe

Any hint is appreciated.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google/llvm-propeller/issues/3?email_source=notifications&email_token=AJJPQRZPEYHF7PUBRKMD6Q3QQIRRVA5CNFSM4JE4OKA2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HUH27DQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJJPQR7BOV2NCAR67JXU3STQQIRRVANCNFSM4JE4OKAQ .

uttampawar commented 4 years ago

@tmsri Thanks, this makes sense. Across, I'm going to apply this methodology to couple of runtimes (PHP and Node.js/V8) where I've seen > 35% CPU front-end bound bottleneck.

Here is an specific example of running Ghost.js (Node.js workload), I see large front-end bound stalls. TMAM_Frontend_Bound(%) 37.11% TMAM_ITLB_Misses(%) 6.81% TMAM_Bad_Speculation(%) 10.58%

These numbers are derived using the TMAM Methodology

We have fixed or at least reduced the ITLB_Misses by 3% with use of large_pages (Node.js optimizations) which also reduced Frontend_stall by the same amount.

I want to apply propeller toolchain to see if it can help reduce these stalls with optimal code layout.

tmsri commented 4 years ago

On Fri, Oct 25, 2019 at 9:57 AM Uttam Pawar notifications@github.com wrote:

@tmsri https://github.com/tmsri Thanks, this makes sense. I'm going to apply this methodology to couple of runtimes (PHP and Node.js/V8) where I've seen > 35% CPU front-end bound bottleneck.

Sounds good, please let us know how this goes. We are happy to help in case you run into issues.

Sri

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/llvm-propeller/issues/3?email_source=notifications&email_token=AJJPQR3RWW2VCKWJLWYHBXLQQMQP7A5CNFSM4JE4OKA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECI6DKA#issuecomment-546431400, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJJPQRYRXFN4PLPD5CK4YOLQQMQP7ANCNFSM4JE4OKAQ .

uttampawar commented 4 years ago

Note: I tried to use propeller optimization technique on two workloads but don't see any benefit.

Here is a new data point for the 'node' binary which includes 'd8' JavaScript engine which is built with llvm-propeller and a "webtooling" workload (https://github.com/v8/web-tooling-benchmark).

Performance data using "perf" is in the gist at, https://gist.github.com/uttampawar/759078fa749170b6b75815874a81162a

On the side note, can someone provide me instructions on how to use 'propeller' steps to build clang compiler and verify the benefit with 'clang' compiler as described in the paper? I appreciate the help. TIA.

tmsri commented 4 years ago

Hi Uttam,

On Mon, Nov 25, 2019 at 6:04 PM Uttam Pawar notifications@github.com wrote:

Note: I tried to use propeller optimization technique on two workloads but don't see any benefit.

Here is a new data point for the 'node' binary which includes 'd8' JavaScript engine which is built with llvm-propeller and a "webtooling" workload (https://github.com/v8/web-tooling-benchmark).

Performance data using "perf" is in the gist at, https://gist.github.com/uttampawar/759078fa749170b6b75815874a81162a

We will try this out too and let you know. Do you know how front end bound this benchmark is?

On the side note, can someone provide me instructions on how to use 'propeller' steps to build clang compiler and verify the benefit with 'clang' compiler as described in the paper?

There is a "plo" directory that is checked in as part of google/llvm-propeller. Do the following:

$ cd plo $ make check-performance

This will build clang with propeller and you should be able to verify the performance.

Thanks Sri

I appreciate the help. TIA.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/llvm-propeller/issues/3?email_source=notifications&email_token=AJJPQR7HO4QE7Y7DVGTAJFDQVR72HA5CNFSM4JE4OKA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFEOHDA#issuecomment-558424972, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJJPQRZCQTTSJTJHARLZQ4TQVR72HANCNFSM4JE4OKAQ .

uttampawar commented 4 years ago

Sri, Thanks for your response. I'll try your suggestion to verify clang performance.

Following are the details about frontend-bound stalls (calculated using TMAM methodology while workload is in steady state) Webtooling: metric_TMAM_Frontend_Bound(%) 23% metricTMAM....ICache_Misses(%) 4% metricTMAM....ITLB_Misses(%) 3% metric_TMAM_Bad_Speculation(%) 11% metric_cycles per txn 37,910,952.8403 (normalized per trasaction) INST_RETIRED.ANY 833,337,440 (normalized per transaction)

And Ghost.js (original): metric_TMAM_Frontend_Bound(%) 35% metricTMAM....ICache_Misses(%) 9% metricTMAM....ITLB_Misses(%) 6% metric_TMAM_Bad_Speculation(%) 10% metric_cycles per txn 480,563,097 (normalized per transaction) INST_RETIRED.ANY 54,570,282 (normalized per request or transaction)

uttampawar commented 4 years ago

@tmsri make check-performance failed.

cherEmitter3runERN4llvm11raw_ostreamE.$45fc0800c800f0679e2102212a27a313.llvm.6216348528576891080)

referenced 378 more times

... ld.lld: error: undefined symbol: llvm::Record::getValueAsInt(llvm::StringRef) const

referenced by AsmMatcherEmitter.cpp /mnt/sdb1/upawar/llvm-propeller/pl .. ... dcd29.tmp.o:((anonymous namespace)::CallingConvEmitter::EmitAction(llvm::Record*, unsigned int, llvm::raw_ostream&) (.$4422eb9397b83d587c5eecd7b780bf63)) referenced 97 more times

ld.lld: error: too many errors emitted, stopping now (use -error-limit=0 to see all errors) clang-10: error: linker command failed with exit code 1 (use -v to see invocation) ninja: build stopped: subcommand failed. Makefile:189: recipe for target 'pgo/build/bin/clang-10' failed make: *** [pgo/build/bin/clang-10] Error 1 This looks like error during LTO build.

BTW, is there is a way to test/verify the benefit just between default clang and clang with propeller (without PGO or LTO)?

tmsri commented 4 years ago

On Tue, Nov 26, 2019 at 10:43 AM Uttam Pawar notifications@github.com wrote:

@tmsri https://github.com/tmsri make check-performance failed.

cherEmitter3runERN4llvm11raw_ostreamE.$45fc0800c800f0679e2102212a27a313.llvm.6216348528576891080)

referenced 378 more times

... ld.lld: error: undefined symbol: llvm::Record::getValueAsInt(llvm::StringRef) const

referenced by AsmMatcherEmitter.cpp /mnt/sdb1/upawar/llvm-propeller/pl

Sorry about that, I think we synced yesterday and this caused a break. We will fix this asap.

.. ... dcd29.tmp.o:((anonymous namespace)::CallingConvEmitter::EmitAction(llvm::Record*, unsigned int, llvm::raw_ostream&) (.$4422eb9397b83d587c5eecd7b780bf63)) referenced 97 more times

ld.lld: error: too many errors emitted, stopping now (use -error-limit=0 to see all errors) clang-10: error: linker command failed with exit code 1 (use -v to see invocation) ninja: build stopped: subcommand failed. Makefile:189: recipe for target 'pgo/build/bin/clang-10' failed make: *** [pgo/build/bin/clang-10] Error 1 This looks like error during LTO build.

BTW, is there is a way to test/verify the benefit just between default clang and clang with propeller (without PGO or LTO)?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/llvm-propeller/issues/3?email_source=notifications&email_token=AJJPQR6ME57GJ7AJEN3GS3DQVVU6HA5CNFSM4JE4OKA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFHBL6I#issuecomment-558765561, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJJPQRZLYUD6GQTH2KCRXTLQVVU6HANCNFSM4JE4OKAQ .

rlavaee commented 4 years ago

@uttampawar I think the error should go away if you do a fresh clone of the repository. (We no longer have pgo as a build directory but we have pgo-vanilla, pgo-labels, and pgo-relocs).

uttampawar commented 4 years ago

Cool. I'll give it a try. Thanks.

uttampawar commented 4 years ago

@rlavaee The performance test with "make check-performance" progressed but didn't succeed completely.

It looks like pgo-labels build failed. Looking into plo/pgo-labels I found,

$ cd plo $ ls -l pgo-labels/build/bin/ total 16 -rw-rw-r-- 1 upawar upawar 0 Nov 26 15:31 clang-10 -rw-rw-r-- 1 upawar upawar 8596 Nov 26 15:30 gen_ast_dump_json_test.py -rwxrwxr-x 1 upawar upawar 2247 Nov 26 15:30 llvm-lit

Other binaries I found in "plo" ... $ ls -l stage1/build/bin/clang-10 -rwxrwxr-x 1 upawar upawar 76390272 Nov 26 15:18 stage1/build/bin/clang-10

$ ls -l stage-pgo-labels/build/bin/clang-10 -rwxrwxr-x 1 upawar upawar 138720464 Nov 26 15:29 stage-pgo-labels/build/bin/clang-10

$ ls -l pgo-vanilla/build/bin/clang-10 (0 byte file) -rw-rw-r-- 1 upawar upawar 0 Nov 26 15:26 pgo-vanilla/build/bin/clang-10

$ ls -l stage-pgo-vanilla/build/bin/clang-10 -rwxrwxr-x 1 upawar upawar 138057128 Nov 26 15:24 stage-pgo-vanilla/build/bin/clang-10

My environment: gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1) LLVM-propeller: commit c2e699365540111e0e2a7187deda45e4b89333a0

Change in paths.mk file, LLVM_DIR=/mnt/sdb1/upawar/propeller-work BUILD_DIR=/mnt/sdb1/upawar/propeller-work/llvm-propeller/build RELEASE_LLVM_BIN=/mnt/sdb1/upawar/propeller-work/llvm-propeller/build/bin
CREATE_LLVM_PROF_DIR=.. $ cd plo; make check-performance

Am I missing something or environmental issue? Any help is appreciated. TIA.

uttampawar commented 4 years ago

@tmsri @rlavaee See complete log in a gist at, https://gist.github.com/uttampawar/8f6d1ec0c9627d50066cb2e0ca35859a

shenhanc78 commented 4 years ago

Hi, Uttam, sorry for the late reply, just got back from Thanksgiving.

I see lots of gold linker plugin errors in building pgo-lables. In theory, gold linker should never be involved in the whole process. Let me dig a little bit to see how this could happen.

Thanks, Han

On Wed, Nov 27, 2019 at 12:04 PM Uttam Pawar notifications@github.com wrote:

@tmsri https://github.com/tmsri @rlavaee https://github.com/rlavaee See complete log in a gist at, https://gist.github.com/uttampawar/8f6d1ec0c9627d50066cb2e0ca35859a

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/llvm-propeller/issues/3?email_source=notifications&email_token=AELA2QUWWFGZIKQ3XPMPYALQV3HEPA5CNFSM4JE4OKA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFKTT7I#issuecomment-559233533, or unsubscribe https://github.com/notifications/unsubscribe-auth/AELA2QUHS5NOP2BNC5JMJR3QV3HEPANCNFSM4JE4OKAQ .

-- Han Shen | Software Engineer | shenhan@google.com | +1-650-440-3330

uttampawar commented 4 years ago

@shenhanc78 Okay. Thanks for the followup.

uttampawar commented 4 years ago

@shenhanc78 Any update? TIA.

shenhanc78 commented 4 years ago

Hi Uttam, yup. I've just pushed a new version which forces lld to be used along all builds. You may pull / sync and have another try.

I'm happy to help if any further problems.

Thanks! -Han

On Fri, Dec 6, 2019 at 10:32 AM Uttam Pawar notifications@github.com wrote:

@shenhanc78 https://github.com/shenhanc78 Any update? TIA.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/llvm-propeller/issues/3?email_source=notifications&email_token=AELA2QWS6O4IYEUL4E6R2PDQXKLDTA5CNFSM4JE4OKA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGE6ZZA#issuecomment-562687204, or unsubscribe https://github.com/notifications/unsubscribe-auth/AELA2QWZ3DR544MZVRE5Z3DQXKLDTANCNFSM4JE4OKAQ .

-- Han Shen | Software Engineer | shenhan@google.com | +1-650-440-3330

uttampawar commented 4 years ago

That's great. I'll give it a try. Thanks.

urjitbhatia commented 4 years ago

hi @uttampawar can you share how you built the node binary for your optimization experiment?

google / llvm-propeller

Don't observe performance improvement for built-in tests with propeller #3

Perf data

Optimized binary