Performance degradation

I did a test on the latest commit (77545c) and bec6c9 on h100 with 30b model and I can see stable performance degradation.

Latest   bec6c9
25 t/s   34t/s

thoughts?

1c2513 is fine, keeps generation at 34t/s f97561 already see a drop to 30t/s

I did some more really heavy tuning for the 4090 and 3090, so it's not too surprising if it's less ideal for the H100. I'm in the process of adding tuning parameters so I can try it on a range of cloud GPUs and get a better sense of what's what.

In the meantime you could try the latest commit (small bugfix) and running with -fmt 0 to disable the fused MLP.

There are also some tunable parameters in extllama/cuda_func/q4_matmul.cu that aren't exposed to the CLI yet. But on lines 136 and 137, you can try different block sizes for the matmul. They don't have to be multiples of 128, but it generally works best when they're multiples of the model's groupsize. I'm still figuring out how this relates to performance, the L2 cache and CUDA spawning thousands of threads all over the place.

    config.matmul_recons_thd = 8
    config.fused_mlp_thd = 0
    config.sdp_thd = 8

still runs at 25t/s on the latest commit

experimenting with block_size_z doesnt really make any difference (tried increasing it all the way up to 512)

I also checked on my 4090, getting stable 38t/s on latest commit

btw, isn't it a typo in model.py?

        # Tuning

        self.matmul_recons_thd = 8
        self.fused_mlp_thd = 2
        self.stp_thd = 8

should be self.sdp_thd

I am getting +10 tokens/sec for 7b and 13b models on 4090 and a6000 (ampere) and about the same speed as before for 33b/65

Wow nice work @turboderp! Seeing +60% perf improvement on a 3060 Ti (55 t/s sustained). I will test a few more GPUs later today.

Typo is fixed. Thanks. But attention probably isn't the issue anyway. I guess I'll have to add a profiling mode to time the CUDA kernel launches, since the performance profiles are so different across architectures. I really have no idea right now why it's slower on Hopper than on Ada, or why the recent version is slower than the older one.

I didn't actually change much in the CUDA code, I just moved more stuff from Python to C++, and mostly trivial stuff too. E.g. instead of passing five separate PyTorch tensors to every C++ function, now it passes a pointer to a C++ object that references the five underlying storages. There's also strictly less initialization of than before.

In the meantime you could try the latest commit (small bugfix) and running with -fmt 0 to disable the fused MLP.

On a 3060 Ti -fmt 0 drops inference from 55 t/s to 45 t/s

I did a profile run (on h100) in case it could give you some hints:

----------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                              Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
----------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                      aten::matmul        18.51%        1.568s        37.92%        3.214s     138.329us     841.665ms         9.61%        3.498s     150.557us         23231
                  cudaLaunchKernel        10.20%     864.248ms        10.20%     864.248ms       3.248us       0.000us         0.00%       0.000us       0.000us        266097
                   aten::transpose         8.60%     728.462ms         8.65%     732.913ms      12.682us     545.773ms         6.23%     765.174ms      13.240us         57791
                     aten::reshape         7.99%     677.284ms         8.76%     742.347ms      12.719us     669.239ms         7.64%        1.004s      17.201us         58365
                       aten::slice         7.23%     612.879ms         7.31%     619.307ms      12.852us     586.072ms         6.69%     820.184ms      17.020us         48189
                      aten::narrow         6.78%     574.774ms        13.83%        1.172s      25.221us     503.891ms         5.75%        1.296s      27.902us         46462
                      aten::expand         6.70%     567.671ms         6.73%     570.544ms      12.381us     558.553ms         6.37%     787.119ms      17.081us         46082
                         aten::bmm         4.42%     374.209ms         5.43%     460.237ms      19.976us     979.146ms        11.17%     979.146ms      42.498us         23040
                       aten::empty         3.44%     291.212ms         3.44%     291.212ms       3.558us     426.136ms         4.86%     426.136ms       5.206us         81851
                  aten::empty_like         3.20%     271.521ms         4.24%     359.016ms      15.288us     181.222ms         2.07%     304.306ms      12.959us         23483
                     aten::softmax         3.18%     269.519ms         4.79%     406.127ms      34.679us     196.191ms         2.24%     470.002ms      40.133us         11711
                        aten::view         3.10%     262.950ms         3.10%     262.950ms       1.084us     621.798ms         7.10%     621.798ms       2.564us        242495
                         aten::add         2.57%     217.911ms         3.48%     294.660ms      12.789us     115.613ms         1.32%     115.613ms       5.018us         23040
                       aten::copy_         2.11%     178.403ms         3.78%     319.924ms      12.987us     403.234ms         4.60%     404.423ms      16.417us         24635
                        aten::silu         1.35%     114.457ms         1.79%     151.923ms      13.188us      50.507ms         0.58%      50.507ms       4.384us         11520
                        aten::div_         1.24%     105.307ms         1.75%     148.533ms      12.543us     204.609ms         2.34%     212.245ms      17.923us         11842
                    aten::_softmax         1.18%     100.120ms         1.61%     136.598ms      11.664us     220.955ms         2.52%     220.955ms      18.867us         11711
               aten::empty_strided         1.06%      90.108ms         1.06%      90.108ms       3.611us     130.414ms         1.49%     130.414ms       5.226us         24954
                        aten::mul_         1.00%      84.874ms         1.43%     121.183ms      10.519us      49.912ms         0.57%      49.912ms       4.333us         11520
                   cudaMemsetAsync         0.78%      66.330ms         0.78%      66.330ms       2.797us       0.000us         0.00%       0.000us       0.000us         23711
              aten::_reshape_alias         0.72%      60.677ms         0.72%      60.677ms       1.041us     332.195ms         3.79%     332.195ms       5.698us         58305
                    aten::_to_copy         0.41%      35.070ms         1.23%     104.265ms      68.147us      22.718ms         0.26%      51.989ms      33.980us          1530
                   cudaMemcpyAsync         0.40%      34.254ms         0.40%      34.254ms      89.203us       0.000us         0.00%       0.000us       0.000us           384
                 aten::multinomial         0.39%      33.354ms         1.03%      87.426ms     457.728us      21.652ms         0.25%      88.339ms     462.508us           191
                        aten::topk         0.33%      27.825ms         0.33%      27.825ms     145.681us      28.954ms         0.33%      28.954ms     151.592us           191
             cudaStreamSynchronize         0.32%      26.784ms         0.32%      26.784ms      69.750us       0.000us         0.00%       0.000us       0.000us           384
                         aten::div         0.28%      23.347ms         0.28%      24.138ms      48.084us       6.949ms         0.08%       6.949ms      13.843us           502
                aten::_unsafe_view         0.28%      23.337ms         0.28%      23.337ms       1.002us      44.064ms         0.50%      44.064ms       1.892us         23291
                      aten::select         0.22%      19.008ms         0.22%      19.049ms      12.402us      19.183ms         0.22%      26.470ms      17.233us          1536
                          aten::to         0.20%      16.539ms         1.43%     120.804ms       4.941us     122.761ms         1.40%     174.750ms       7.147us         24451
                        aten::item         0.17%      14.669ms         0.17%      14.704ms      10.981us      14.706ms         0.17%      20.947ms      15.644us          1339
                  aten::as_strided         0.16%      13.818ms         0.16%      13.818ms       0.089us     694.881ms         7.93%     694.881ms       4.490us        154745
                         aten::sum         0.15%      12.627ms         0.15%      12.637ms      33.081us       8.352ms         0.10%      12.438ms      32.560us           382
                       aten::fill_         0.14%      12.233ms         0.19%      15.866ms      11.849us      19.605ms         0.22%      22.566ms      16.853us          1339
                aten::index_select         0.12%      10.378ms         0.18%      15.182ms      79.073us       8.597ms         0.10%      16.075ms      83.724us           192
                       aten::index         0.12%      10.005ms         0.17%      14.645ms      38.338us       8.405ms         0.10%      16.607ms      43.474us           382
                           INVALID         0.09%       7.979ms         0.09%       7.979ms      18.998us       0.000us         0.00%       0.000us       0.000us           420
                   aten::embedding         0.08%       6.905ms         0.29%      24.847ms     129.411us       5.045ms         0.06%      25.625ms     133.464us           192
_scaled_dot_product_attention_math         0.08%       6.713ms         0.75%      63.756ms       1.063ms       1.408ms         0.02%     118.151ms       1.969ms            60
                         aten::cat         0.07%       5.932ms         0.17%      14.647ms      76.686us       5.125ms         0.06%      15.620ms      81.780us           191
                   aten::unsqueeze         0.06%       4.678ms         0.06%       4.693ms      12.253us       4.675ms         0.05%       6.535ms      17.063us           383
                       aten::zeros         0.05%       4.499ms         0.13%      11.062ms      57.615us       3.500ms         0.04%      11.869ms      61.818us           192
                         aten::min         0.05%       4.282ms         0.05%       4.294ms      22.482us       3.359ms         0.04%       5.181ms      27.126us           191
                      aten::linear         0.05%       4.220ms         0.27%      23.002ms     120.429us       2.404ms         0.03%      63.718ms     333.602us           191
                         aten::max         0.05%       4.183ms         0.05%       4.191ms      21.942us       3.280ms         0.04%       5.094ms      26.670us           191
                        aten::add_         0.05%       3.963ms         0.13%      10.638ms      42.382us      17.812ms         0.20%      24.956ms      99.426us           251
                          aten::mm         0.04%       3.673ms         0.05%       4.507ms      23.597us      49.462ms         0.56%      49.462ms     258.963us           191
                          aten::ge         0.04%       2.983ms         0.12%      10.048ms      52.607us       3.028ms         0.03%      10.988ms      57.529us           191
                          aten::eq         0.03%       2.649ms         0.11%       9.034ms      46.808us       2.719ms         0.03%       9.985ms      51.736us           193
                          aten::lt         0.03%       2.635ms         0.11%       9.028ms      47.267us       2.695ms         0.03%       9.980ms      52.251us           191
                       aten::zero_         0.03%       2.469ms         0.06%       5.306ms      27.635us       2.363ms         0.03%       6.176ms      32.167us           192
                           aten::t         0.03%       2.424ms         0.06%       4.680ms      24.503us       2.228ms         0.03%       5.119ms      26.801us           191
                      aten::argmax         0.03%       2.208ms         0.03%       2.208ms      11.560us       2.229ms         0.03%       3.083ms      16.141us           191
                       aten::clone         0.02%       2.012ms         0.05%       4.432ms      71.484us     390.000us         0.00%       2.543ms      41.016us            62
             cudaFuncGetAttributes         0.02%       1.403ms         0.02%       1.403ms       3.340us       0.000us         0.00%       0.000us       0.000us           420
aten::scaled_dot_product_attention         0.01%       1.211ms         0.77%      64.967ms       1.083ms     224.000us         0.00%     118.375ms       1.973ms            60
                aten::exponential_         0.01%       1.202ms         0.01%       1.202ms       6.293us       2.132ms         0.02%       2.132ms      11.162us           191
                        aten::triu         0.01%     706.000us         0.01%     706.000us     706.000us     721.000us         0.01%     721.000us     721.000us             1
                 aten::bitwise_and         0.00%     396.000us         0.00%     396.000us       2.073us       1.315ms         0.02%       1.315ms       6.885us           191
              cudaFuncSetAttribute         0.00%     382.000us         0.00%     382.000us       0.303us       0.000us         0.00%       0.000us       0.000us          1260
         cudaOccupancyMaxActive...         0.00%     217.000us         0.00%     217.000us       1.340us       0.000us         0.00%       0.000us       0.000us           162
                     aten::resize_         0.00%      94.000us         0.00%      94.000us       0.492us       1.000ms         0.01%       1.000ms       5.236us           191
                        aten::full         0.00%      41.000us         0.10%       8.803ms       8.803ms      31.000us         0.00%       8.812ms       8.812ms             1
                  aten::is_nonzero         0.00%      35.000us         0.00%      73.000us      36.500us      32.000us         0.00%      84.000us      42.000us             2
         aten::_local_scalar_dense         0.00%      35.000us         0.00%      35.000us       0.026us       6.241ms         0.07%       6.241ms       4.661us          1339
                   aten::expand_as         0.00%      20.000us         0.00%      39.000us      39.000us      20.000us         0.00%      47.000us      47.000us             1
                     aten::detach_         0.00%      19.000us         0.00%      25.000us      25.000us      18.000us         0.00%      31.000us      31.000us             1
             cudaDeviceSynchronize         0.00%      12.000us         0.00%      12.000us      12.000us       0.000us         0.00%       0.000us       0.000us             1
                           detach_         0.00%       6.000us         0.00%       6.000us       6.000us      13.000us         0.00%      13.000us      13.000us             1
                  aten::lift_fresh         0.00%       1.000us         0.00%       1.000us       0.005us     870.000us         0.01%     870.000us       4.531us           192
                aten::resolve_conj         0.00%       0.000us         0.00%       0.000us       0.000us       3.000us         0.00%       3.000us       3.000us             1
                 aten::resolve_neg         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us         0.00%       2.000us       2.000us             1
----------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------

What cuda and pytorch version is this? Lots of the ops look very slow on a per call basis.

What cuda and pytorch version is this? Lots of the ops look very slow on a per call basis.

2.0.1+cu118

There's something screwy going on if the Torch matmul is taking CPU time. It has to be a synchronization issue, otherwise I don't know what to make of that. Could you do the same profiling on commit 1c2513 just to compare?

-----------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                               Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-----------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                       aten::matmul        19.15%        1.393s        39.91%        2.902s     140.269us     747.086ms         9.65%        3.133s     151.413us         20690
                   cudaLaunchKernel        10.55%     767.541ms        10.55%     767.541ms       3.384us       0.000us         0.00%       0.000us       0.000us        226812
                    aten::transpose         8.61%     625.975ms         8.75%     636.075ms      12.358us     562.149ms         7.26%     795.347ms      15.453us         51470
                      aten::reshape         8.57%     622.928ms         9.71%     705.957ms      13.581us     598.814ms         7.73%     905.141ms      17.413us         51981
                        aten::slice         7.41%     538.733ms         7.59%     552.014ms      12.862us     532.834ms         6.88%     749.523ms      17.464us         42918
                       aten::expand         6.94%     504.815ms         6.99%     508.267ms      12.384us     495.776ms         6.40%     699.206ms      17.036us         41042
                       aten::narrow         6.62%     481.086ms        13.94%        1.014s      24.493us     471.784ms         6.09%        1.199s      28.983us         41380
                          aten::bmm         4.94%     359.567ms         6.03%     438.265ms      21.358us     891.672ms        11.51%     891.672ms      43.454us         20520
                      aten::softmax         3.30%     240.013ms         5.07%     368.813ms      35.361us     174.198ms         2.25%     425.309ms      40.777us         10430
                   aten::empty_like         3.29%     238.921ms         4.46%     324.682ms      15.519us     157.629ms         2.04%     262.427ms      12.544us         20921
                        aten::empty         2.39%     173.972ms         2.39%     173.972ms       4.072us     249.085ms         3.22%     249.085ms       5.830us         42725
                         aten::view         2.36%     171.523ms         2.36%     171.523ms       1.108us     418.236ms         5.40%     418.236ms       2.702us        154772
                        aten::copy_         2.30%     167.346ms         4.42%     321.699ms      14.658us     371.783ms         4.80%     373.009ms      16.996us         21947
                          aten::add         1.57%     113.929ms         2.08%     151.290ms      14.660us     161.769ms         2.09%     161.769ms      15.675us         10320
                    cudaMemsetAsync         1.51%     109.967ms         1.51%     109.967ms       3.505us       0.000us         0.00%       0.000us       0.000us         31370
                         aten::div_         1.39%     101.319ms         1.92%     139.900ms      13.273us     187.846ms         2.43%     194.598ms      18.463us         10540
                     aten::_softmax         1.31%      94.964ms         1.77%     128.789ms      12.348us     203.503ms         2.63%     203.503ms      19.511us         10430
                aten::empty_strided         1.21%      88.159ms         1.21%      88.159ms       3.967us     111.330ms         1.44%     111.330ms       5.009us         22224
                         aten::add_         1.09%      79.414ms         1.62%     117.589ms      11.274us      58.249ms         0.75%      64.529ms       6.187us         10430
               aten::_reshape_alias         0.76%      54.917ms         0.76%      54.917ms       1.058us     304.103ms         3.93%     304.103ms       5.857us         51921
                     aten::_to_copy         0.43%      30.979ms         1.28%      93.276ms      68.485us      20.049ms         0.26%      46.327ms      34.014us          1362
                    cudaMemcpyAsync         0.42%      30.318ms         0.42%      30.318ms      88.649us       0.000us         0.00%       0.000us       0.000us           342
                  aten::multinomial         0.39%      28.702ms         1.05%      76.423ms     449.547us      19.197ms         0.25%      77.245ms     454.382us           170
                   aten::as_strided         0.37%      26.904ms         0.37%      26.904ms       0.195us     664.572ms         8.58%     664.572ms       4.822us        137819
                         aten::topk         0.35%      25.434ms         0.35%      25.434ms     149.612us      26.468ms         0.34%      26.468ms     155.694us           170
              cudaStreamSynchronize         0.34%      24.483ms         0.34%      24.483ms      71.588us       0.000us         0.00%       0.000us       0.000us           342
                 aten::_unsafe_view         0.30%      21.722ms         0.30%      21.722ms       1.047us      35.586ms         0.46%      35.586ms       1.715us         20750
                       aten::select         0.23%      16.929ms         0.23%      16.977ms      12.410us      17.076ms         0.22%      23.495ms      17.175us          1368
                           aten::to         0.20%      14.598ms         1.48%     107.874ms       4.957us     106.794ms         1.38%     153.121ms       7.036us         21763
                         aten::item         0.17%      12.633ms         0.17%      12.665ms      10.625us      12.763ms         0.16%      18.275ms      15.331us          1192
                          aten::sum         0.15%      10.813ms         0.15%      10.824ms      31.835us       7.492ms         0.10%      10.651ms      31.326us           340
                        aten::fill_         0.13%       9.791ms         0.18%      13.156ms      11.037us      15.972ms         0.21%      18.595ms      15.600us          1192
                        aten::index         0.12%       8.864ms         0.18%      12.949ms      38.085us       7.430ms         0.10%      14.660ms      43.118us           340
                 aten::index_select         0.11%       8.220ms         0.17%      12.499ms      73.094us       6.477ms         0.08%      13.125ms      76.754us           171
:_scaled_dot_product_attention_math         0.09%       6.695ms         0.75%      54.488ms     908.133us       1.291ms         0.02%     105.382ms       1.756ms            60
                    aten::embedding         0.09%       6.298ms         0.29%      21.280ms     124.444us       4.487ms         0.06%      21.627ms     126.474us           171
                          aten::div         0.07%       5.442ms         0.09%       6.244ms      13.574us       5.518ms         0.07%       5.518ms      11.996us           460
                          aten::cat         0.07%       5.313ms         0.18%      12.997ms      76.453us       4.577ms         0.06%      13.845ms      81.441us           170
                    aten::unsqueeze         0.06%       4.150ms         0.06%       4.167ms      12.220us       4.190ms         0.05%       5.814ms      17.050us           341
                        aten::zeros         0.06%       4.120ms         0.14%      10.231ms      59.830us       3.103ms         0.04%      10.818ms      63.263us           171
                          aten::min         0.05%       3.789ms         0.05%       3.802ms      22.365us       3.001ms         0.04%       4.600ms      27.059us           170
                       aten::linear         0.05%       3.788ms         0.29%      20.880ms     122.824us       1.993ms         0.03%      56.254ms     330.906us           170
                          aten::max         0.05%       3.700ms         0.05%       3.709ms      21.818us       2.904ms         0.04%       4.499ms      26.465us           170
                           aten::mm         0.05%       3.501ms         0.06%       4.331ms      25.476us      44.283ms         0.57%      44.283ms     260.488us           170
                            INVALID         0.04%       3.189ms         0.04%       3.189ms       7.593us       0.000us         0.00%       0.000us       0.000us           420
                           aten::ge         0.04%       2.685ms         0.12%       8.948ms      52.635us       2.740ms         0.04%       9.792ms      57.600us           170
                           aten::eq         0.03%       2.385ms         0.11%       7.987ms      46.436us       2.442ms         0.03%       8.839ms      51.390us           172
                           aten::lt         0.03%       2.385ms         0.11%       8.014ms      47.141us       2.425ms         0.03%       8.845ms      52.029us           170
                        aten::zero_         0.03%       2.202ms         0.07%       4.944ms      28.912us       2.079ms         0.03%       5.689ms      33.269us           171
                            aten::t         0.03%       2.162ms         0.06%       4.172ms      24.541us       1.760ms         0.02%       4.088ms      24.047us           170
                        aten::clone         0.03%       2.161ms         0.39%      28.155ms     454.113us     330.000us         0.00%       2.283ms      36.823us            62
                       aten::argmax         0.03%       1.956ms         0.03%       1.956ms      11.506us       1.971ms         0.03%       2.730ms      16.059us           170
              cudaFuncGetAttributes         0.02%       1.574ms         0.02%       1.574ms       3.748us       0.000us         0.00%       0.000us       0.000us           420
                         aten::silu         0.02%       1.345ms         0.03%       1.830ms      30.500us       2.534ms         0.03%       2.534ms      42.233us            60
 aten::scaled_dot_product_attention         0.02%       1.332ms         0.77%      55.820ms     930.333us     199.000us         0.00%     105.581ms       1.760ms            60
                 aten::exponential_         0.02%       1.114ms         0.02%       1.114ms       6.553us       1.969ms         0.03%       1.969ms      11.582us           170
                         aten::mul_         0.01%       1.082ms         0.16%      11.955ms     199.250us       3.189ms         0.04%       3.189ms      53.150us            60
                         aten::triu         0.01%     717.000us         0.01%     717.000us     717.000us     727.000us         0.01%     727.000us     727.000us             1
               cudaFuncSetAttribute         0.01%     444.000us         0.01%     444.000us       0.352us       0.000us         0.00%       0.000us       0.000us          1260
                  aten::bitwise_and         0.00%     357.000us         0.00%     357.000us       2.100us       1.151ms         0.01%       1.151ms       6.771us           170
veBlocksPerMultiprocessorWithFla...         0.00%     198.000us         0.00%     198.000us       1.650us       0.000us         0.00%       0.000us       0.000us           120
                      aten::resize_         0.00%     139.000us         0.00%     139.000us       0.818us     889.000us         0.01%     889.000us       5.229us           170
                         aten::full         0.00%      43.000us         0.09%       6.652ms       6.652ms      31.000us         0.00%       6.659ms       6.659ms             1
                   aten::is_nonzero         0.00%      39.000us         0.00%      80.000us      40.000us      37.000us         0.00%      94.000us      47.000us             2
          aten::_local_scalar_dense         0.00%      32.000us         0.00%      32.000us       0.027us       5.512ms         0.07%       5.512ms       4.624us          1192
                      aten::detach_         0.00%      30.000us         0.00%      36.000us      36.000us      20.000us         0.00%      42.000us      42.000us             1
                    aten::expand_as         0.00%      19.000us         0.00%      38.000us      38.000us      19.000us         0.00%      46.000us      46.000us             1
              cudaDeviceSynchronize         0.00%      13.000us         0.00%      13.000us      13.000us       0.000us         0.00%       0.000us       0.000us             1
                            detach_         0.00%       6.000us         0.00%       6.000us       6.000us      22.000us         0.00%      22.000us      22.000us             1
                   aten::lift_fresh         0.00%       1.000us         0.00%       1.000us       0.006us     763.000us         0.01%     763.000us       4.462us           171
                 aten::resolve_conj         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         0.00%       6.000us       6.000us             1
                  aten::resolve_neg         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         0.00%       6.000us       6.000us             1
-----------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------

Not sure exactly what I'm doing but here is my Xeon benchmark info with today's commits. Any better way to do cProfile?

3090 Xeon v4 PCIE 3x16

Bandwidth 6.8441 GB/s
Bandwidth 6.8758 GB/s
Bandwidth 6.4444 GB/s

P40 Xeon v4 PCIE 3x16

Bandwidth 6.4782 GB/s
Bandwidth 6.4646 GB/s
Bandwidth 6.4195 GB/s

7b:

-- Tokenizer: /home/supermicro/ai/text-generation-webui-testing/models/llama-7b-4bit/tokenizer.model
 -- Model config: /home/supermicro/ai/text-generation-webui-testing/models/llama-7b-4bit/config.json
 -- Model: /home/supermicro/ai/text-generation-webui-testing/models/llama-7b-4bit/llama-7b-4bit.safetensors
 -- Sequence length: 2048
 -- Tuning:
 -- --matmul_recons_thd: 8
 -- --fused_mlp_thd: 2
 -- --sdp_thd: 8
 -- Options: ['perf']
 ** Time, Load model: 2.29 seconds
 ** Time, Load tokenizer: 0.01 seconds
 -- Groupsize (inferred): None
 -- Act-order (inferred): no
 ** VRAM, Model: [cuda:0] 3,520.85 MB
 -- Warmup pass 1...
 ** Time, Warmup: 1.32 seconds
 -- Warmup pass 2...
 ** Time, Warmup: 0.35 seconds
 -- Warmup pass 3...
 ** Time, Warmup: 0.35 seconds
 -- Inference, first pass.
 ** Time, Inference: 0.35 seconds
 ** Speed: 5469.47 tokens/second
 -- Generating 128 tokens, 1920 token prompt...
 ** Speed: 44.28 tokens/second
 -- Generating 128 tokens, 4 token prompt...
 ** Speed: 44.48 tokens/second
 ** VRAM, Inference: [cuda:0] 1,555.17 MB
 ** VRAM, Total: [cuda:0] 5,076.02 MB

13b

  -- Tokenizer: /home/supermicro/ai/text-generation-webui-testing/models/llama-13b-4bit/tokenizer.model
 -- Model config: /home/supermicro/ai/text-generation-webui-testing/models/llama-13b-4bit/config.json
 -- Model: /home/supermicro/ai/text-generation-webui-testing/models/llama-13b-4bit/llama-13b-4bit.safetensors
 -- Sequence length: 2048
 -- Tuning:
 -- --matmul_recons_thd: 8
 -- --fused_mlp_thd: 2
 -- --sdp_thd: 8
 -- Options: ['perf']
 ** Time, Load model: 2.89 seconds
 ** Time, Load tokenizer: 0.01 seconds
 -- Groupsize (inferred): None
 -- Act-order (inferred): no
 ** VRAM, Model: [cuda:0] 6,589.50 MB
 -- Warmup pass 1...
 ** Time, Warmup: 1.71 seconds
 -- Warmup pass 2...
 ** Time, Warmup: 0.63 seconds
 -- Warmup pass 3...
 ** Time, Warmup: 0.63 seconds
 -- Inference, first pass.
 ** Time, Inference: 0.63 seconds
 ** Speed: 3033.24 tokens/second
 -- Generating 128 tokens, 1920 token prompt...
 ** Speed: 37.00 tokens/second
 -- Generating 128 tokens, 4 token prompt...
 ** Speed: 37.17 tokens/second
 ** VRAM, Inference: [cuda:0] 2,254.17 MB
 ** VRAM, Total: [cuda:0] 8,843.67 MB

30b

-- Tokenizer: /home/supermicro/ai/text-generation-webui-testing/models/llama-30b-4bit/tokenizer.model
 -- Model config: /home/supermicro/ai/text-generation-webui-testing/models/llama-30b-4bit/config.json
 -- Model: /home/supermicro/ai/text-generation-webui-testing/models/llama-30b-4bit/llama-30b-4bit.safetensors
 -- Sequence length: 2048
 -- Tuning:
 -- --matmul_recons_thd: 8
 -- --fused_mlp_thd: 2
 -- --sdp_thd: 8
 -- Options: ['perf']
 ** Time, Load model: 4.73 seconds
 ** Time, Load tokenizer: 0.01 seconds
 -- Groupsize (inferred): None
 -- Act-order (inferred): no
 ** VRAM, Model: [cuda:0] 16,234.42 MB
 -- Warmup pass 1...
 ** Time, Warmup: 2.61 seconds
 -- Warmup pass 2...
 ** Time, Warmup: 1.46 seconds
 -- Warmup pass 3...
 ** Time, Warmup: 1.47 seconds
 -- Inference, first pass.
 ** Time, Inference: 1.47 seconds
 ** Speed: 1309.90 tokens/second
 -- Generating 128 tokens, 1920 token prompt...
 ** Speed: 22.07 tokens/second
 -- Generating 128 tokens, 4 token prompt...
 ** Speed: 24.22 tokens/second
 ** VRAM, Inference: [cuda:0] 3,964.67 MB
 ** VRAM, Total: [cuda:0] 20,199.09 MB

CUDA_VISIBLE_DEVICES=0 python -m torch.utils.bottleneck test_inference.py


--------------------------------------------------------------------------------
  Environment Summary
--------------------------------------------------------------------------------
PyTorch 2.0.1+cu118 DEBUG compiled w/ CUDA 11.8
Running with Python 3.10 and CUDA 11.8.89

`pip3 list` truncated output:
audiolm-pytorch==1.0.6
clip-anytorch==2.5.2
ema-pytorch==0.2.3
lion-pytorch==0.1.2
mypy-extensions==1.0.0
numpy==1.23.5
torch==2.0.1+cu118
torchaudio==2.0.2+cu118
torchcrepe==0.0.18
torchdiffeq==0.2.3
torchsde==0.2.5
torchvision==0.15.2+cu118
vector-quantize-pytorch==1.5.18
--------------------------------------------------------------------------------
  cProfile output
--------------------------------------------------------------------------------
         3375048 function calls (3279583 primitive calls) in 10.222 seconds

   Ordered by: internal time
   List reduced from 2529 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     7540    2.150    0.000    2.150    0.000 {method 'to' of 'torch._C._TensorBase' objects}
    25824    1.551    0.000    1.551    0.000 {built-in method exllama_ext.q4_matmul}
    12800    0.578    0.000    0.578    0.000 {built-in method torch.matmul}
     6432    0.542    0.000    4.592    0.001 /home/supermicro/ai/text-generation-webui-testing/repositories/exllama/model.py:273(forward)
    90672    0.527    0.000    0.527    0.000 {method 'view' of 'torch._C._TensorBase' objects}
     6432    0.334    0.000    5.798    0.001 /home/supermicro/ai/text-generation-webui-testing/repositories/exllama/model.py:363(forward)
     6400    0.291    0.000    0.291    0.000 {built-in method exllama_ext.q4_mlp}
    26027    0.242    0.000    0.243    0.000 {built-in method torch.empty}
    12865    0.218    0.000    0.218    0.000 {method 'copy_' of 'torch._C._TensorBase' objects}
     6664    0.216    0.000    0.216    0.000 {built-in method exllama_ext.rms_norm}
    12864    0.212    0.000    0.212    0.000 {built-in method exllama_ext.rope_}
    25824    0.189    0.000    2.270    0.000 /home/supermicro/ai/text-generation-webui-testing/repositories/exllama/cuda_ext.py:76(ext_q4_matmul)
      206    0.165    0.001    0.165    0.001 {method 'read' of '_io.BufferedReader' objects}
    25728    0.164    0.000    0.164    0.000 {method 'narrow' of 'torch._C._TensorBase' objects}
      739    0.161    0.000    0.181    0.000 {method 'get_tensor' of 'builtins.safe_open' objects}

--------------------------------------------------------------------------------
  autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                     Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                 aten::to         0.01%       3.000us        82.64%      39.988ms      39.988ms             1  
           aten::_to_copy         0.04%      19.000us        82.64%      39.985ms      39.985ms             1  
              aten::copy_         0.07%      33.000us        78.58%      38.022ms      38.022ms             1  
          cudaMemcpyAsync        78.51%      37.989ms        78.51%      37.989ms      37.989ms             1  
                 aten::to         0.00%       2.000us        13.92%       6.734ms       6.734ms             1  
           aten::_to_copy         0.01%       6.000us        13.91%       6.732ms       6.732ms             1  
              aten::copy_         0.02%      11.000us        13.88%       6.717ms       6.717ms             1  
    cudaStreamSynchronize        13.83%       6.692ms        13.83%       6.692ms       6.692ms             1  
                 aten::to         0.00%       2.000us         7.55%       3.655ms       3.655ms             1  
           aten::_to_copy         0.02%      10.000us         7.55%       3.653ms       3.653ms             1  
                 aten::to         7.43%       3.597ms         7.43%       3.597ms       3.597ms             1  
           aten::_to_copy         0.02%      10.000us         7.43%       3.596ms       3.596ms             1  
                 aten::to         0.00%       2.000us         7.43%       3.594ms       3.594ms             1  
           aten::_to_copy         0.02%       9.000us         7.42%       3.592ms       3.592ms             1  
                 aten::to         0.00%       1.000us         7.42%       3.590ms       3.590ms             1  
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 48.386ms

--------------------------------------------------------------------------------
  autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

    Because the autograd profiler uses the CUDA event API,
    the CUDA time column reports approximately max(cuda_time, cpu_time).
    Please ignore this output if your code does not use CUDA.

-------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
               Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
           aten::to         0.03%      13.000us        91.25%      38.167ms      38.167ms       9.000us         0.02%      38.210ms      38.210ms             1  
     aten::_to_copy         0.11%      46.000us        91.21%      38.154ms      38.154ms      19.000us         0.04%      38.201ms      38.201ms             1  
        aten::copy_         0.07%      31.000us        91.07%      38.093ms      38.093ms      38.162ms        84.19%      38.162ms      38.162ms             1  
    cudaMemcpyAsync        90.99%      38.062ms        90.99%      38.062ms      38.062ms       0.000us         0.00%       0.000us       0.000us             1  
           aten::to         0.03%      11.000us         8.51%       3.558ms       3.558ms       8.000us         0.02%       3.587ms       3.587ms             1  
     aten::_to_copy         0.06%      27.000us         8.48%       3.547ms       3.547ms      16.000us         0.04%       3.579ms       3.579ms             1  
        aten::copy_         0.03%      12.000us         8.40%       3.512ms       3.512ms       3.551ms         7.83%       3.551ms       3.551ms             1  
    cudaMemcpyAsync         8.37%       3.500ms         8.37%       3.500ms       3.500ms       0.000us         0.00%       0.000us       0.000us             1  
           aten::to         0.03%      11.000us         8.34%       3.490ms       3.490ms       8.000us         0.02%       3.531ms       3.531ms             1  
           aten::to         0.03%      11.000us         8.32%       3.481ms       3.481ms       8.000us         0.02%       3.516ms       3.516ms             1  
     aten::_to_copy         0.07%      28.000us         8.32%       3.479ms       3.479ms      16.000us         0.04%       3.523ms       3.523ms             1  
     aten::_to_copy         0.06%      27.000us         8.30%       3.470ms       3.470ms      15.000us         0.03%       3.508ms       3.508ms             1  
           aten::to         0.03%      11.000us         8.29%       3.466ms       3.466ms       7.000us         0.02%       3.508ms       3.508ms             1  
     aten::_to_copy         0.06%      27.000us         8.26%       3.455ms       3.455ms      15.000us         0.03%       3.501ms       3.501ms             1  
        aten::copy_         0.03%      12.000us         8.23%       3.442ms       3.442ms       3.493ms         7.71%       3.493ms       3.493ms             1  
-------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 41.829ms
Self CUDA time total: 45.327ms

@turboderp I did some extra profiling using nvidia-smi dmon --gpm-metrics flag, and you can clearly see the difference in utilisation between latest commit (401fa8) and commit 1c2513 (that has a higher performance on h100 in my case)

I ran nvidia-smi dmon -s p --gpm-metrics=2,3,4,5,7,13 which gives me the following readings:

pwr      - power consumption
smutil   - SM Activity
smocc    - SM Occupancy
intutil  - Integer Activity
mmaact   - Tensor Activity
hmmat    - HMMA Tensor Activity
fp16     - FP16 Activity

run on commit 401fa8 ("slower")

pwr   smutil    smocc   intutil   mmaact   hmmat    fp16
  W     GPM:%    GPM:%     GPM:%    GPM:%   GPM:%   GPM:%
  91      0        0         0        0       0       0
 195      0        0         0        0       0       0
 253     42       17        17        9       9       1
 288     69       30        43        1       1       4
 290     85       36        53        1       1       5
 289     85       37        53        1       1       5
 291     85       37        53        1       1       5
 276     85       37        53        1       1       5
 287     85       37        53        1       1       5
 290     80       33        41       10      10       4
 291     85       36        53        1       1       5
 291     85       36        53        1       1       5
 292     85       37        53        1       1       5
 292     85       37        53        1       1       5
 288     80       33        41       10      10       4
 289     85       36        53        1       1       5
 290     85       36        53        1       1       5
 291     85       36        53        1       1       5
 290     85       37        53        1       1       5
 291     85       37        53        1       1       5
 281     85       37        53        1       1       5
 304     82       35        51        1       1       5
 290     84       34        43       10      10       4
 291     85       36        53        1       1       5
 291     85       36        53        1       1       5
 293     85       37        53        1       1       5
 293     85       37        53        1       1       5
 285     80       33        41       10      10       4
 290     80       33        41       10      10       4
 291     85       36        53        1       1       5
 291     85       36        53        1       1       5
 292     85       37        53        1       1       5
 293     85       37        53        1       1       5
 240     85       37        53        1       1       5
  94     72       31        45        1       1       4
  93      0        0         0        0       0       0

run on commit 1c2513 ("faster")

pwr   smutil    smocc   intutil   mmaact   hmmat    fp16
  W     GPM:%    GPM:%     GPM:%    GPM:%   GPM:%   GPM:%
  87      0        0         0        0       0       0
  87      0        0         0        0       0       0
 230     54       22        19       10      10       3
 250     58       22        30        1       1       5
 295     78       30        40        2       2       7
 295     78       30        40        2       2       7
 298     78       30        40        2       2       7
 282     78       30        40        2       2       7
 306     74       28        38        2       2       7
 298     81       32        34       10      10       6
 298     78       29        40        2       2       7
 301     78       30        40        2       2       7
 310     78       30        40        2       2       7
 296     75       30        34        7       7       6
 301     80       31        38        5       5       7
 301     78       30        40        2       2       7
 303     78       30        40        2       2       7
 304     78       30        40        2       2       7
 296     77       31        31       10      10       5
 300     77       29        40        2       2       7
 303     78       30        40        2       2       7
 300     78       30        40        2       2       7
 297     78       30        40        2       2       7
 299     76       30        31       10      10       5
 298     78       29        40        2       2       7
 299     78       30        40        2       2       7
 300     78       30        40        2       2       7
 131     78       30        40        2       2       7
  92     16        6         8        0       0       1
  92      0        0         0        0       0       0

"faster" commit has higher fp16 and tensor activity as well as lower integer activity and sm utilization/occupancy. Does it give you any insights on the issue?

There is something fishy going on for sure. SM utilization is usually a good thing. It's apparently doing extra work for some reason...? Higher GPU power consumption too. I'm very baffled.

But, I'm going to be adding Graphs soon, and it may completely change how the H100 reacts to the model. So probably best to wait and see what happens with that first.

@dvoidus : Well, I've put graphs on hold for now, because it turns out there's too much overhead per graph launch for it to be beneficial until I compile basically the whole decoder block into a single graph, and that's a huge rewrite. However, tracing showed that there was actually a lot more overhead from PyTorch than I was expecting, so I've been addressing that instead.

I'd be curious how the latest commit performs on the H100.

I tried, but still see the same speeds (27t/s for the worst case) The only reproducible difference is context inference speed increased from 3500 t/s to 4500 t/s For the reference "fastest" commit (bec6c9) gives 37 t/s and 3300 t/s for context

BTW, on the 4090 it works without degradation and actually faster (as everyone else reported): 40 t/s worst case and 3000 t/s for context

turboderp / exllama

Performance degradation #29