profiling discussion - Githubissues

bsless commented 3 years ago

In trying to understand the differences in results, I tried to narrow on one specific case, numeric work, 1 task vs 8 tasks, profiled using perf:

1 task:

         13,420.83 msec task-clock                #    1.148 CPUs utilized          
             3,597      context-switches          #  268.016 /sec                   
                41      cpu-migrations            #    3.055 /sec                   
            70,980      page-faults               #    5.289 K/sec                  
    59,213,265,355      cycles                    #    4.412 GHz                      (75.11%)
        79,933,962      stalled-cycles-frontend   #    0.13% frontend cycles idle     (74.94%)
        42,064,297      stalled-cycles-backend    #    0.07% backend cycles idle      (75.03%)
   215,059,220,826      instructions              #    3.63  insn per cycle         
                                                  #    0.00  stalled cycles per insn  (74.90%)
    52,812,484,326      branches                  #    3.935 G/sec                    (74.74%)
        87,093,019      branch-misses             #    0.16% of all branches          (74.89%)
    55,639,651,014      L1-dcache-loads           #    4.146 G/sec                    (75.45%)
       278,077,033      L1-dcache-load-misses     #    0.50% of all L1-dcache accesses  (75.43%)
   <not supported>      LLC-loads                                                   
   <not supported>      LLC-load-misses                                             

      11.686483240 seconds time elapsed

      13.271356000 seconds user
       0.160257000 seconds sys

8 tasks:

        124,407.34 msec task-clock                #    6.216 CPUs utilized          
             6,711      context-switches          #   53.944 /sec                   
               785      cpu-migrations            #    6.310 /sec                   
            68,990      page-faults               #  554.549 /sec                   
   571,763,633,739      cycles                    #    4.596 GHz                      (75.06%)
       230,862,488      stalled-cycles-frontend   #    0.04% frontend cycles idle     (75.02%)
       182,163,236      stalled-cycles-backend    #    0.03% backend cycles idle      (74.97%)
 1,622,077,543,360      instructions              #    2.84  insn per cycle         
                                                  #    0.00  stalled cycles per insn  (74.99%)
   402,594,512,871      branches                  #    3.236 G/sec                    (75.02%)
        94,137,299      branch-misses             #    0.02% of all branches          (74.97%)
   406,620,206,004      L1-dcache-loads           #    3.268 G/sec                    (74.97%)
       322,458,701      L1-dcache-load-misses     #    0.08% of all L1-dcache accesses  (75.05%)
   <not supported>      LLC-loads                                                   
   <not supported>      LLC-load-misses                                             

      20.015600697 seconds time elapsed

     124.211218000 seconds user
       0.206246000 seconds sys

We can easily see lower throughput, many more cache loads and more stalls. I hoped cache misses might explain the discrepancy but that doesn't seem to be the case. Anything you see here which might explain the results?

joinr commented 3 years ago

Thanks. Interesting data. I am looking at alternative means for the numeric work example (in retrospect). Curious if there's a way to peg a thread to a cpu via affinity from the JVM. Also curious if invoking mapv is causing anything funny to happen.

bsless commented 3 years ago

Also curious if invoking mapv is causing anything funny to happen

Wondering about that as well. Next item to try on the list was a CompletableFuture and join on all of them in the end. Idea is avoiding a situation where an object is shared between multiple threads. Also, all the extra cache misses are from Thread parking.

joinr commented 3 years ago

Yeah, during early discussions I went down the thread local state rabbit hole (and thread local allocation buffers TLAB) etc. Could be another thing. I am not well versed on JVM internals enough to control the low level thread locality details, but I am definitely interested.

bsless commented 3 years ago

Not necessarily related to TLAB, was using this link as reference: https://www.baeldung.com/java-false-sharing-contended

joinr / paralleltest

profiling discussion #1