gunnarmorling / 1brc

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java
https://www.morling.dev/blog/one-billion-row-challenge/
Apache License 2.0
6.3k stars 1.88k forks source link

jerrinot's improvement: fast path for name len <= 16 #531

Closed jerrinot closed 9 months ago

jerrinot commented 9 months ago

Check List:

Before:

 Performance counter stats for './calculate_average_jerrinot.sh':

         19,145.65 msec task-clock                       #    5.900 CPUs utilized             
             5,793      context-switches                 #  302.575 /sec                      
                81      cpu-migrations                   #    4.231 /sec                      
           232,953      page-faults                      #   12.167 K/sec                     
    54,015,223,236      cycles                           #    2.821 GHz                         (40.76%)
       176,037,764      stalled-cycles-frontend          #    0.33% frontend cycles idle        (41.02%)
     1,826,777,715      stalled-cycles-backend           #    3.38% backend cycles idle         (40.84%)
   100,057,332,275      instructions                     #    1.85  insn per cycle            
                                                  #    0.02  stalled cycles per insn     (40.57%)
    11,213,254,503      branches                         #  585.681 M/sec                       (40.30%)
       603,050,595      branch-misses                    #    5.38% of all branches             (40.13%)
    35,585,906,589      L1-dcache-loads                  #    1.859 G/sec                       (39.89%)
     1,467,102,078      L1-dcache-load-misses            #    4.12% of all L1-dcache accesses   (39.80%)
   <not supported>      LLC-loads                                                             
   <not supported>      LLC-load-misses                                                       
     2,483,522,264      L1-icache-loads                  #  129.717 M/sec                       (39.81%)
         2,332,788      L1-icache-load-misses            #    0.09% of all L1-icache accesses   (39.92%)
       653,639,965      dTLB-loads                       #   34.140 M/sec                       (40.09%)
         5,544,287      dTLB-load-misses                 #    0.85% of all dTLB cache accesses  (40.13%)
         3,356,879      iTLB-loads                       #  175.334 K/sec                       (40.30%)
           545,322      iTLB-load-misses                 #   16.24% of all iTLB cache accesses  (40.34%)
       528,177,172      L1-dcache-prefetches             #   27.587 M/sec                       (40.43%)
   <not supported>      L1-dcache-prefetch-misses                                             

       3.244999882 seconds time elapsed

      16.739830000 seconds user
       2.159510000 seconds sys

After:

Performance counter stats for './calculate_average_jerrinot.sh':

         17,000.16 msec task-clock                       #    5.761 CPUs utilized             
             4,984      context-switches                 #  293.174 /sec                      
                93      cpu-migrations                   #    5.471 /sec                      
           230,437      page-faults                      #   13.555 K/sec                     
    48,216,840,801      cycles                           #    2.836 GHz                         (40.12%)
       155,327,318      stalled-cycles-frontend          #    0.32% frontend cycles idle        (40.28%)
     1,892,974,253      stalled-cycles-backend           #    3.93% backend cycles idle         (40.42%)
   107,098,386,256      instructions                     #    2.22  insn per cycle            
                                                  #    0.02  stalled cycles per insn     (40.43%)
     9,103,770,509      branches                         #  535.511 M/sec                       (40.47%)
       187,970,886      branch-misses                    #    2.06% of all branches             (40.51%)
    32,904,170,346      L1-dcache-loads                  #    1.936 G/sec                       (40.50%)
     1,346,035,661      L1-dcache-load-misses            #    4.09% of all L1-dcache accesses   (40.59%)
   <not supported>      LLC-loads                                                             
   <not supported>      LLC-load-misses                                                       
     2,333,650,681      L1-icache-loads                  #  137.272 M/sec                       (40.53%)
         2,545,110      L1-icache-load-misses            #    0.11% of all L1-icache accesses   (40.47%)
       634,253,403      dTLB-loads                       #   37.309 M/sec                       (40.40%)
         4,885,566      dTLB-load-misses                 #    0.77% of all dTLB cache accesses  (40.25%)
         4,291,122      iTLB-loads                       #  252.417 K/sec                       (40.16%)
           443,230      iTLB-load-misses                 #   10.33% of all iTLB cache accesses  (40.03%)
       495,150,401      L1-dcache-prefetches             #   29.126 M/sec                       (39.90%)
   <not supported>      L1-dcache-prefetch-misses                                             

       2.951148393 seconds time elapsed

      14.540940000 seconds user
       2.258833000 seconds sys