Open dvoidus opened 1 year ago
1c2513 is fine, keeps generation at 34t/s f97561 already see a drop to 30t/s
I did some more really heavy tuning for the 4090 and 3090, so it's not too surprising if it's less ideal for the H100. I'm in the process of adding tuning parameters so I can try it on a range of cloud GPUs and get a better sense of what's what.
In the meantime you could try the latest commit (small bugfix) and running with -fmt 0
to disable the fused MLP.
There are also some tunable parameters in extllama/cuda_func/q4_matmul.cu
that aren't exposed to the CLI yet. But on lines 136 and 137, you can try different block sizes for the matmul. They don't have to be multiples of 128, but it generally works best when they're multiples of the model's groupsize. I'm still figuring out how this relates to performance, the L2 cache and CUDA spawning thousands of threads all over the place.
config.matmul_recons_thd = 8
config.fused_mlp_thd = 0
config.sdp_thd = 8
still runs at 25t/s on the latest commit
experimenting with block_size_z
doesnt really make any difference (tried increasing it all the way up to 512)
I also checked on my 4090, getting stable 38t/s on latest commit
btw, isn't it a typo in model.py?
# Tuning
self.matmul_recons_thd = 8
self.fused_mlp_thd = 2
self.stp_thd = 8
should be self.sdp_thd
I am getting +10 tokens/sec for 7b and 13b models on 4090 and a6000 (ampere) and about the same speed as before for 33b/65
Wow nice work @turboderp! Seeing +60% perf improvement on a 3060 Ti (55 t/s sustained). I will test a few more GPUs later today.
Typo is fixed. Thanks. But attention probably isn't the issue anyway. I guess I'll have to add a profiling mode to time the CUDA kernel launches, since the performance profiles are so different across architectures. I really have no idea right now why it's slower on Hopper than on Ada, or why the recent version is slower than the older one.
I didn't actually change much in the CUDA code, I just moved more stuff from Python to C++, and mostly trivial stuff too. E.g. instead of passing five separate PyTorch tensors to every C++ function, now it passes a pointer to a C++ object that references the five underlying storages. There's also strictly less initialization of than before.
In the meantime you could try the latest commit (small bugfix) and running with
-fmt 0
to disable the fused MLP.
On a 3060 Ti -fmt 0
drops inference from 55 t/s to 45 t/s
I did a profile run (on h100) in case it could give you some hints:
---------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
---------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::matmul 18.51% 1.568s 37.92% 3.214s 138.329us 841.665ms 9.61% 3.498s 150.557us 23231
cudaLaunchKernel 10.20% 864.248ms 10.20% 864.248ms 3.248us 0.000us 0.00% 0.000us 0.000us 266097
aten::transpose 8.60% 728.462ms 8.65% 732.913ms 12.682us 545.773ms 6.23% 765.174ms 13.240us 57791
aten::reshape 7.99% 677.284ms 8.76% 742.347ms 12.719us 669.239ms 7.64% 1.004s 17.201us 58365
aten::slice 7.23% 612.879ms 7.31% 619.307ms 12.852us 586.072ms 6.69% 820.184ms 17.020us 48189
aten::narrow 6.78% 574.774ms 13.83% 1.172s 25.221us 503.891ms 5.75% 1.296s 27.902us 46462
aten::expand 6.70% 567.671ms 6.73% 570.544ms 12.381us 558.553ms 6.37% 787.119ms 17.081us 46082
aten::bmm 4.42% 374.209ms 5.43% 460.237ms 19.976us 979.146ms 11.17% 979.146ms 42.498us 23040
aten::empty 3.44% 291.212ms 3.44% 291.212ms 3.558us 426.136ms 4.86% 426.136ms 5.206us 81851
aten::empty_like 3.20% 271.521ms 4.24% 359.016ms 15.288us 181.222ms 2.07% 304.306ms 12.959us 23483
aten::softmax 3.18% 269.519ms 4.79% 406.127ms 34.679us 196.191ms 2.24% 470.002ms 40.133us 11711
aten::view 3.10% 262.950ms 3.10% 262.950ms 1.084us 621.798ms 7.10% 621.798ms 2.564us 242495
aten::add 2.57% 217.911ms 3.48% 294.660ms 12.789us 115.613ms 1.32% 115.613ms 5.018us 23040
aten::copy_ 2.11% 178.403ms 3.78% 319.924ms 12.987us 403.234ms 4.60% 404.423ms 16.417us 24635
aten::silu 1.35% 114.457ms 1.79% 151.923ms 13.188us 50.507ms 0.58% 50.507ms 4.384us 11520
aten::div_ 1.24% 105.307ms 1.75% 148.533ms 12.543us 204.609ms 2.34% 212.245ms 17.923us 11842
aten::_softmax 1.18% 100.120ms 1.61% 136.598ms 11.664us 220.955ms 2.52% 220.955ms 18.867us 11711
aten::empty_strided 1.06% 90.108ms 1.06% 90.108ms 3.611us 130.414ms 1.49% 130.414ms 5.226us 24954
aten::mul_ 1.00% 84.874ms 1.43% 121.183ms 10.519us 49.912ms 0.57% 49.912ms 4.333us 11520
cudaMemsetAsync 0.78% 66.330ms 0.78% 66.330ms 2.797us 0.000us 0.00% 0.000us 0.000us 23711
aten::_reshape_alias 0.72% 60.677ms 0.72% 60.677ms 1.041us 332.195ms 3.79% 332.195ms 5.698us 58305
aten::_to_copy 0.41% 35.070ms 1.23% 104.265ms 68.147us 22.718ms 0.26% 51.989ms 33.980us 1530
cudaMemcpyAsync 0.40% 34.254ms 0.40% 34.254ms 89.203us 0.000us 0.00% 0.000us 0.000us 384
aten::multinomial 0.39% 33.354ms 1.03% 87.426ms 457.728us 21.652ms 0.25% 88.339ms 462.508us 191
aten::topk 0.33% 27.825ms 0.33% 27.825ms 145.681us 28.954ms 0.33% 28.954ms 151.592us 191
cudaStreamSynchronize 0.32% 26.784ms 0.32% 26.784ms 69.750us 0.000us 0.00% 0.000us 0.000us 384
aten::div 0.28% 23.347ms 0.28% 24.138ms 48.084us 6.949ms 0.08% 6.949ms 13.843us 502
aten::_unsafe_view 0.28% 23.337ms 0.28% 23.337ms 1.002us 44.064ms 0.50% 44.064ms 1.892us 23291
aten::select 0.22% 19.008ms 0.22% 19.049ms 12.402us 19.183ms 0.22% 26.470ms 17.233us 1536
aten::to 0.20% 16.539ms 1.43% 120.804ms 4.941us 122.761ms 1.40% 174.750ms 7.147us 24451
aten::item 0.17% 14.669ms 0.17% 14.704ms 10.981us 14.706ms 0.17% 20.947ms 15.644us 1339
aten::as_strided 0.16% 13.818ms 0.16% 13.818ms 0.089us 694.881ms 7.93% 694.881ms 4.490us 154745
aten::sum 0.15% 12.627ms 0.15% 12.637ms 33.081us 8.352ms 0.10% 12.438ms 32.560us 382
aten::fill_ 0.14% 12.233ms 0.19% 15.866ms 11.849us 19.605ms 0.22% 22.566ms 16.853us 1339
aten::index_select 0.12% 10.378ms 0.18% 15.182ms 79.073us 8.597ms 0.10% 16.075ms 83.724us 192
aten::index 0.12% 10.005ms 0.17% 14.645ms 38.338us 8.405ms 0.10% 16.607ms 43.474us 382
INVALID 0.09% 7.979ms 0.09% 7.979ms 18.998us 0.000us 0.00% 0.000us 0.000us 420
aten::embedding 0.08% 6.905ms 0.29% 24.847ms 129.411us 5.045ms 0.06% 25.625ms 133.464us 192
_scaled_dot_product_attention_math 0.08% 6.713ms 0.75% 63.756ms 1.063ms 1.408ms 0.02% 118.151ms 1.969ms 60
aten::cat 0.07% 5.932ms 0.17% 14.647ms 76.686us 5.125ms 0.06% 15.620ms 81.780us 191
aten::unsqueeze 0.06% 4.678ms 0.06% 4.693ms 12.253us 4.675ms 0.05% 6.535ms 17.063us 383
aten::zeros 0.05% 4.499ms 0.13% 11.062ms 57.615us 3.500ms 0.04% 11.869ms 61.818us 192
aten::min 0.05% 4.282ms 0.05% 4.294ms 22.482us 3.359ms 0.04% 5.181ms 27.126us 191
aten::linear 0.05% 4.220ms 0.27% 23.002ms 120.429us 2.404ms 0.03% 63.718ms 333.602us 191
aten::max 0.05% 4.183ms 0.05% 4.191ms 21.942us 3.280ms 0.04% 5.094ms 26.670us 191
aten::add_ 0.05% 3.963ms 0.13% 10.638ms 42.382us 17.812ms 0.20% 24.956ms 99.426us 251
aten::mm 0.04% 3.673ms 0.05% 4.507ms 23.597us 49.462ms 0.56% 49.462ms 258.963us 191
aten::ge 0.04% 2.983ms 0.12% 10.048ms 52.607us 3.028ms 0.03% 10.988ms 57.529us 191
aten::eq 0.03% 2.649ms 0.11% 9.034ms 46.808us 2.719ms 0.03% 9.985ms 51.736us 193
aten::lt 0.03% 2.635ms 0.11% 9.028ms 47.267us 2.695ms 0.03% 9.980ms 52.251us 191
aten::zero_ 0.03% 2.469ms 0.06% 5.306ms 27.635us 2.363ms 0.03% 6.176ms 32.167us 192
aten::t 0.03% 2.424ms 0.06% 4.680ms 24.503us 2.228ms 0.03% 5.119ms 26.801us 191
aten::argmax 0.03% 2.208ms 0.03% 2.208ms 11.560us 2.229ms 0.03% 3.083ms 16.141us 191
aten::clone 0.02% 2.012ms 0.05% 4.432ms 71.484us 390.000us 0.00% 2.543ms 41.016us 62
cudaFuncGetAttributes 0.02% 1.403ms 0.02% 1.403ms 3.340us 0.000us 0.00% 0.000us 0.000us 420
aten::scaled_dot_product_attention 0.01% 1.211ms 0.77% 64.967ms 1.083ms 224.000us 0.00% 118.375ms 1.973ms 60
aten::exponential_ 0.01% 1.202ms 0.01% 1.202ms 6.293us 2.132ms 0.02% 2.132ms 11.162us 191
aten::triu 0.01% 706.000us 0.01% 706.000us 706.000us 721.000us 0.01% 721.000us 721.000us 1
aten::bitwise_and 0.00% 396.000us 0.00% 396.000us 2.073us 1.315ms 0.02% 1.315ms 6.885us 191
cudaFuncSetAttribute 0.00% 382.000us 0.00% 382.000us 0.303us 0.000us 0.00% 0.000us 0.000us 1260
cudaOccupancyMaxActive... 0.00% 217.000us 0.00% 217.000us 1.340us 0.000us 0.00% 0.000us 0.000us 162
aten::resize_ 0.00% 94.000us 0.00% 94.000us 0.492us 1.000ms 0.01% 1.000ms 5.236us 191
aten::full 0.00% 41.000us 0.10% 8.803ms 8.803ms 31.000us 0.00% 8.812ms 8.812ms 1
aten::is_nonzero 0.00% 35.000us 0.00% 73.000us 36.500us 32.000us 0.00% 84.000us 42.000us 2
aten::_local_scalar_dense 0.00% 35.000us 0.00% 35.000us 0.026us 6.241ms 0.07% 6.241ms 4.661us 1339
aten::expand_as 0.00% 20.000us 0.00% 39.000us 39.000us 20.000us 0.00% 47.000us 47.000us 1
aten::detach_ 0.00% 19.000us 0.00% 25.000us 25.000us 18.000us 0.00% 31.000us 31.000us 1
cudaDeviceSynchronize 0.00% 12.000us 0.00% 12.000us 12.000us 0.000us 0.00% 0.000us 0.000us 1
detach_ 0.00% 6.000us 0.00% 6.000us 6.000us 13.000us 0.00% 13.000us 13.000us 1
aten::lift_fresh 0.00% 1.000us 0.00% 1.000us 0.005us 870.000us 0.01% 870.000us 4.531us 192
aten::resolve_conj 0.00% 0.000us 0.00% 0.000us 0.000us 3.000us 0.00% 3.000us 3.000us 1
aten::resolve_neg 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 0.00% 2.000us 2.000us 1
---------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
What cuda and pytorch version is this? Lots of the ops look very slow on a per call basis.
What cuda and pytorch version is this? Lots of the ops look very slow on a per call basis.
2.0.1+cu118
There's something screwy going on if the Torch matmul is taking CPU time. It has to be a synchronization issue, otherwise I don't know what to make of that. Could you do the same profiling on commit 1c2513 just to compare?
----------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
----------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::matmul 19.15% 1.393s 39.91% 2.902s 140.269us 747.086ms 9.65% 3.133s 151.413us 20690
cudaLaunchKernel 10.55% 767.541ms 10.55% 767.541ms 3.384us 0.000us 0.00% 0.000us 0.000us 226812
aten::transpose 8.61% 625.975ms 8.75% 636.075ms 12.358us 562.149ms 7.26% 795.347ms 15.453us 51470
aten::reshape 8.57% 622.928ms 9.71% 705.957ms 13.581us 598.814ms 7.73% 905.141ms 17.413us 51981
aten::slice 7.41% 538.733ms 7.59% 552.014ms 12.862us 532.834ms 6.88% 749.523ms 17.464us 42918
aten::expand 6.94% 504.815ms 6.99% 508.267ms 12.384us 495.776ms 6.40% 699.206ms 17.036us 41042
aten::narrow 6.62% 481.086ms 13.94% 1.014s 24.493us 471.784ms 6.09% 1.199s 28.983us 41380
aten::bmm 4.94% 359.567ms 6.03% 438.265ms 21.358us 891.672ms 11.51% 891.672ms 43.454us 20520
aten::softmax 3.30% 240.013ms 5.07% 368.813ms 35.361us 174.198ms 2.25% 425.309ms 40.777us 10430
aten::empty_like 3.29% 238.921ms 4.46% 324.682ms 15.519us 157.629ms 2.04% 262.427ms 12.544us 20921
aten::empty 2.39% 173.972ms 2.39% 173.972ms 4.072us 249.085ms 3.22% 249.085ms 5.830us 42725
aten::view 2.36% 171.523ms 2.36% 171.523ms 1.108us 418.236ms 5.40% 418.236ms 2.702us 154772
aten::copy_ 2.30% 167.346ms 4.42% 321.699ms 14.658us 371.783ms 4.80% 373.009ms 16.996us 21947
aten::add 1.57% 113.929ms 2.08% 151.290ms 14.660us 161.769ms 2.09% 161.769ms 15.675us 10320
cudaMemsetAsync 1.51% 109.967ms 1.51% 109.967ms 3.505us 0.000us 0.00% 0.000us 0.000us 31370
aten::div_ 1.39% 101.319ms 1.92% 139.900ms 13.273us 187.846ms 2.43% 194.598ms 18.463us 10540
aten::_softmax 1.31% 94.964ms 1.77% 128.789ms 12.348us 203.503ms 2.63% 203.503ms 19.511us 10430
aten::empty_strided 1.21% 88.159ms 1.21% 88.159ms 3.967us 111.330ms 1.44% 111.330ms 5.009us 22224
aten::add_ 1.09% 79.414ms 1.62% 117.589ms 11.274us 58.249ms 0.75% 64.529ms 6.187us 10430
aten::_reshape_alias 0.76% 54.917ms 0.76% 54.917ms 1.058us 304.103ms 3.93% 304.103ms 5.857us 51921
aten::_to_copy 0.43% 30.979ms 1.28% 93.276ms 68.485us 20.049ms 0.26% 46.327ms 34.014us 1362
cudaMemcpyAsync 0.42% 30.318ms 0.42% 30.318ms 88.649us 0.000us 0.00% 0.000us 0.000us 342
aten::multinomial 0.39% 28.702ms 1.05% 76.423ms 449.547us 19.197ms 0.25% 77.245ms 454.382us 170
aten::as_strided 0.37% 26.904ms 0.37% 26.904ms 0.195us 664.572ms 8.58% 664.572ms 4.822us 137819
aten::topk 0.35% 25.434ms 0.35% 25.434ms 149.612us 26.468ms 0.34% 26.468ms 155.694us 170
cudaStreamSynchronize 0.34% 24.483ms 0.34% 24.483ms 71.588us 0.000us 0.00% 0.000us 0.000us 342
aten::_unsafe_view 0.30% 21.722ms 0.30% 21.722ms 1.047us 35.586ms 0.46% 35.586ms 1.715us 20750
aten::select 0.23% 16.929ms 0.23% 16.977ms 12.410us 17.076ms 0.22% 23.495ms 17.175us 1368
aten::to 0.20% 14.598ms 1.48% 107.874ms 4.957us 106.794ms 1.38% 153.121ms 7.036us 21763
aten::item 0.17% 12.633ms 0.17% 12.665ms 10.625us 12.763ms 0.16% 18.275ms 15.331us 1192
aten::sum 0.15% 10.813ms 0.15% 10.824ms 31.835us 7.492ms 0.10% 10.651ms 31.326us 340
aten::fill_ 0.13% 9.791ms 0.18% 13.156ms 11.037us 15.972ms 0.21% 18.595ms 15.600us 1192
aten::index 0.12% 8.864ms 0.18% 12.949ms 38.085us 7.430ms 0.10% 14.660ms 43.118us 340
aten::index_select 0.11% 8.220ms 0.17% 12.499ms 73.094us 6.477ms 0.08% 13.125ms 76.754us 171
:_scaled_dot_product_attention_math 0.09% 6.695ms 0.75% 54.488ms 908.133us 1.291ms 0.02% 105.382ms 1.756ms 60
aten::embedding 0.09% 6.298ms 0.29% 21.280ms 124.444us 4.487ms 0.06% 21.627ms 126.474us 171
aten::div 0.07% 5.442ms 0.09% 6.244ms 13.574us 5.518ms 0.07% 5.518ms 11.996us 460
aten::cat 0.07% 5.313ms 0.18% 12.997ms 76.453us 4.577ms 0.06% 13.845ms 81.441us 170
aten::unsqueeze 0.06% 4.150ms 0.06% 4.167ms 12.220us 4.190ms 0.05% 5.814ms 17.050us 341
aten::zeros 0.06% 4.120ms 0.14% 10.231ms 59.830us 3.103ms 0.04% 10.818ms 63.263us 171
aten::min 0.05% 3.789ms 0.05% 3.802ms 22.365us 3.001ms 0.04% 4.600ms 27.059us 170
aten::linear 0.05% 3.788ms 0.29% 20.880ms 122.824us 1.993ms 0.03% 56.254ms 330.906us 170
aten::max 0.05% 3.700ms 0.05% 3.709ms 21.818us 2.904ms 0.04% 4.499ms 26.465us 170
aten::mm 0.05% 3.501ms 0.06% 4.331ms 25.476us 44.283ms 0.57% 44.283ms 260.488us 170
INVALID 0.04% 3.189ms 0.04% 3.189ms 7.593us 0.000us 0.00% 0.000us 0.000us 420
aten::ge 0.04% 2.685ms 0.12% 8.948ms 52.635us 2.740ms 0.04% 9.792ms 57.600us 170
aten::eq 0.03% 2.385ms 0.11% 7.987ms 46.436us 2.442ms 0.03% 8.839ms 51.390us 172
aten::lt 0.03% 2.385ms 0.11% 8.014ms 47.141us 2.425ms 0.03% 8.845ms 52.029us 170
aten::zero_ 0.03% 2.202ms 0.07% 4.944ms 28.912us 2.079ms 0.03% 5.689ms 33.269us 171
aten::t 0.03% 2.162ms 0.06% 4.172ms 24.541us 1.760ms 0.02% 4.088ms 24.047us 170
aten::clone 0.03% 2.161ms 0.39% 28.155ms 454.113us 330.000us 0.00% 2.283ms 36.823us 62
aten::argmax 0.03% 1.956ms 0.03% 1.956ms 11.506us 1.971ms 0.03% 2.730ms 16.059us 170
cudaFuncGetAttributes 0.02% 1.574ms 0.02% 1.574ms 3.748us 0.000us 0.00% 0.000us 0.000us 420
aten::silu 0.02% 1.345ms 0.03% 1.830ms 30.500us 2.534ms 0.03% 2.534ms 42.233us 60
aten::scaled_dot_product_attention 0.02% 1.332ms 0.77% 55.820ms 930.333us 199.000us 0.00% 105.581ms 1.760ms 60
aten::exponential_ 0.02% 1.114ms 0.02% 1.114ms 6.553us 1.969ms 0.03% 1.969ms 11.582us 170
aten::mul_ 0.01% 1.082ms 0.16% 11.955ms 199.250us 3.189ms 0.04% 3.189ms 53.150us 60
aten::triu 0.01% 717.000us 0.01% 717.000us 717.000us 727.000us 0.01% 727.000us 727.000us 1
cudaFuncSetAttribute 0.01% 444.000us 0.01% 444.000us 0.352us 0.000us 0.00% 0.000us 0.000us 1260
aten::bitwise_and 0.00% 357.000us 0.00% 357.000us 2.100us 1.151ms 0.01% 1.151ms 6.771us 170
veBlocksPerMultiprocessorWithFla... 0.00% 198.000us 0.00% 198.000us 1.650us 0.000us 0.00% 0.000us 0.000us 120
aten::resize_ 0.00% 139.000us 0.00% 139.000us 0.818us 889.000us 0.01% 889.000us 5.229us 170
aten::full 0.00% 43.000us 0.09% 6.652ms 6.652ms 31.000us 0.00% 6.659ms 6.659ms 1
aten::is_nonzero 0.00% 39.000us 0.00% 80.000us 40.000us 37.000us 0.00% 94.000us 47.000us 2
aten::_local_scalar_dense 0.00% 32.000us 0.00% 32.000us 0.027us 5.512ms 0.07% 5.512ms 4.624us 1192
aten::detach_ 0.00% 30.000us 0.00% 36.000us 36.000us 20.000us 0.00% 42.000us 42.000us 1
aten::expand_as 0.00% 19.000us 0.00% 38.000us 38.000us 19.000us 0.00% 46.000us 46.000us 1
cudaDeviceSynchronize 0.00% 13.000us 0.00% 13.000us 13.000us 0.000us 0.00% 0.000us 0.000us 1
detach_ 0.00% 6.000us 0.00% 6.000us 6.000us 22.000us 0.00% 22.000us 22.000us 1
aten::lift_fresh 0.00% 1.000us 0.00% 1.000us 0.006us 763.000us 0.01% 763.000us 4.462us 171
aten::resolve_conj 0.00% 0.000us 0.00% 0.000us 0.000us 6.000us 0.00% 6.000us 6.000us 1
aten::resolve_neg 0.00% 0.000us 0.00% 0.000us 0.000us 6.000us 0.00% 6.000us 6.000us 1
----------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Not sure exactly what I'm doing but here is my Xeon benchmark info with today's commits. Any better way to do cProfile?
3090 Xeon v4 PCIE 3x16
Bandwidth 6.8441 GB/s
Bandwidth 6.8758 GB/s
Bandwidth 6.4444 GB/s
P40 Xeon v4 PCIE 3x16
Bandwidth 6.4782 GB/s
Bandwidth 6.4646 GB/s
Bandwidth 6.4195 GB/s
7b:
-- Tokenizer: /home/supermicro/ai/text-generation-webui-testing/models/llama-7b-4bit/tokenizer.model
-- Model config: /home/supermicro/ai/text-generation-webui-testing/models/llama-7b-4bit/config.json
-- Model: /home/supermicro/ai/text-generation-webui-testing/models/llama-7b-4bit/llama-7b-4bit.safetensors
-- Sequence length: 2048
-- Tuning:
-- --matmul_recons_thd: 8
-- --fused_mlp_thd: 2
-- --sdp_thd: 8
-- Options: ['perf']
** Time, Load model: 2.29 seconds
** Time, Load tokenizer: 0.01 seconds
-- Groupsize (inferred): None
-- Act-order (inferred): no
** VRAM, Model: [cuda:0] 3,520.85 MB
-- Warmup pass 1...
** Time, Warmup: 1.32 seconds
-- Warmup pass 2...
** Time, Warmup: 0.35 seconds
-- Warmup pass 3...
** Time, Warmup: 0.35 seconds
-- Inference, first pass.
** Time, Inference: 0.35 seconds
** Speed: 5469.47 tokens/second
-- Generating 128 tokens, 1920 token prompt...
** Speed: 44.28 tokens/second
-- Generating 128 tokens, 4 token prompt...
** Speed: 44.48 tokens/second
** VRAM, Inference: [cuda:0] 1,555.17 MB
** VRAM, Total: [cuda:0] 5,076.02 MB
13b
-- Tokenizer: /home/supermicro/ai/text-generation-webui-testing/models/llama-13b-4bit/tokenizer.model
-- Model config: /home/supermicro/ai/text-generation-webui-testing/models/llama-13b-4bit/config.json
-- Model: /home/supermicro/ai/text-generation-webui-testing/models/llama-13b-4bit/llama-13b-4bit.safetensors
-- Sequence length: 2048
-- Tuning:
-- --matmul_recons_thd: 8
-- --fused_mlp_thd: 2
-- --sdp_thd: 8
-- Options: ['perf']
** Time, Load model: 2.89 seconds
** Time, Load tokenizer: 0.01 seconds
-- Groupsize (inferred): None
-- Act-order (inferred): no
** VRAM, Model: [cuda:0] 6,589.50 MB
-- Warmup pass 1...
** Time, Warmup: 1.71 seconds
-- Warmup pass 2...
** Time, Warmup: 0.63 seconds
-- Warmup pass 3...
** Time, Warmup: 0.63 seconds
-- Inference, first pass.
** Time, Inference: 0.63 seconds
** Speed: 3033.24 tokens/second
-- Generating 128 tokens, 1920 token prompt...
** Speed: 37.00 tokens/second
-- Generating 128 tokens, 4 token prompt...
** Speed: 37.17 tokens/second
** VRAM, Inference: [cuda:0] 2,254.17 MB
** VRAM, Total: [cuda:0] 8,843.67 MB
30b
-- Tokenizer: /home/supermicro/ai/text-generation-webui-testing/models/llama-30b-4bit/tokenizer.model
-- Model config: /home/supermicro/ai/text-generation-webui-testing/models/llama-30b-4bit/config.json
-- Model: /home/supermicro/ai/text-generation-webui-testing/models/llama-30b-4bit/llama-30b-4bit.safetensors
-- Sequence length: 2048
-- Tuning:
-- --matmul_recons_thd: 8
-- --fused_mlp_thd: 2
-- --sdp_thd: 8
-- Options: ['perf']
** Time, Load model: 4.73 seconds
** Time, Load tokenizer: 0.01 seconds
-- Groupsize (inferred): None
-- Act-order (inferred): no
** VRAM, Model: [cuda:0] 16,234.42 MB
-- Warmup pass 1...
** Time, Warmup: 2.61 seconds
-- Warmup pass 2...
** Time, Warmup: 1.46 seconds
-- Warmup pass 3...
** Time, Warmup: 1.47 seconds
-- Inference, first pass.
** Time, Inference: 1.47 seconds
** Speed: 1309.90 tokens/second
-- Generating 128 tokens, 1920 token prompt...
** Speed: 22.07 tokens/second
-- Generating 128 tokens, 4 token prompt...
** Speed: 24.22 tokens/second
** VRAM, Inference: [cuda:0] 3,964.67 MB
** VRAM, Total: [cuda:0] 20,199.09 MB
CUDA_VISIBLE_DEVICES=0 python -m torch.utils.bottleneck test_inference.py
--------------------------------------------------------------------------------
Environment Summary
--------------------------------------------------------------------------------
PyTorch 2.0.1+cu118 DEBUG compiled w/ CUDA 11.8
Running with Python 3.10 and CUDA 11.8.89
`pip3 list` truncated output:
audiolm-pytorch==1.0.6
clip-anytorch==2.5.2
ema-pytorch==0.2.3
lion-pytorch==0.1.2
mypy-extensions==1.0.0
numpy==1.23.5
torch==2.0.1+cu118
torchaudio==2.0.2+cu118
torchcrepe==0.0.18
torchdiffeq==0.2.3
torchsde==0.2.5
torchvision==0.15.2+cu118
vector-quantize-pytorch==1.5.18
--------------------------------------------------------------------------------
cProfile output
--------------------------------------------------------------------------------
3375048 function calls (3279583 primitive calls) in 10.222 seconds
Ordered by: internal time
List reduced from 2529 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
7540 2.150 0.000 2.150 0.000 {method 'to' of 'torch._C._TensorBase' objects}
25824 1.551 0.000 1.551 0.000 {built-in method exllama_ext.q4_matmul}
12800 0.578 0.000 0.578 0.000 {built-in method torch.matmul}
6432 0.542 0.000 4.592 0.001 /home/supermicro/ai/text-generation-webui-testing/repositories/exllama/model.py:273(forward)
90672 0.527 0.000 0.527 0.000 {method 'view' of 'torch._C._TensorBase' objects}
6432 0.334 0.000 5.798 0.001 /home/supermicro/ai/text-generation-webui-testing/repositories/exllama/model.py:363(forward)
6400 0.291 0.000 0.291 0.000 {built-in method exllama_ext.q4_mlp}
26027 0.242 0.000 0.243 0.000 {built-in method torch.empty}
12865 0.218 0.000 0.218 0.000 {method 'copy_' of 'torch._C._TensorBase' objects}
6664 0.216 0.000 0.216 0.000 {built-in method exllama_ext.rms_norm}
12864 0.212 0.000 0.212 0.000 {built-in method exllama_ext.rope_}
25824 0.189 0.000 2.270 0.000 /home/supermicro/ai/text-generation-webui-testing/repositories/exllama/cuda_ext.py:76(ext_q4_matmul)
206 0.165 0.001 0.165 0.001 {method 'read' of '_io.BufferedReader' objects}
25728 0.164 0.000 0.164 0.000 {method 'narrow' of 'torch._C._TensorBase' objects}
739 0.161 0.000 0.181 0.000 {method 'get_tensor' of 'builtins.safe_open' objects}
--------------------------------------------------------------------------------
autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
top 15 events sorted by cpu_time_total
------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
------------------------- ------------ ------------ ------------ ------------ ------------ ------------
aten::to 0.01% 3.000us 82.64% 39.988ms 39.988ms 1
aten::_to_copy 0.04% 19.000us 82.64% 39.985ms 39.985ms 1
aten::copy_ 0.07% 33.000us 78.58% 38.022ms 38.022ms 1
cudaMemcpyAsync 78.51% 37.989ms 78.51% 37.989ms 37.989ms 1
aten::to 0.00% 2.000us 13.92% 6.734ms 6.734ms 1
aten::_to_copy 0.01% 6.000us 13.91% 6.732ms 6.732ms 1
aten::copy_ 0.02% 11.000us 13.88% 6.717ms 6.717ms 1
cudaStreamSynchronize 13.83% 6.692ms 13.83% 6.692ms 6.692ms 1
aten::to 0.00% 2.000us 7.55% 3.655ms 3.655ms 1
aten::_to_copy 0.02% 10.000us 7.55% 3.653ms 3.653ms 1
aten::to 7.43% 3.597ms 7.43% 3.597ms 3.597ms 1
aten::_to_copy 0.02% 10.000us 7.43% 3.596ms 3.596ms 1
aten::to 0.00% 2.000us 7.43% 3.594ms 3.594ms 1
aten::_to_copy 0.02% 9.000us 7.42% 3.592ms 3.592ms 1
aten::to 0.00% 1.000us 7.42% 3.590ms 3.590ms 1
------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 48.386ms
--------------------------------------------------------------------------------
autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
top 15 events sorted by cpu_time_total
Because the autograd profiler uses the CUDA event API,
the CUDA time column reports approximately max(cuda_time, cpu_time).
Please ignore this output if your code does not use CUDA.
------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::to 0.03% 13.000us 91.25% 38.167ms 38.167ms 9.000us 0.02% 38.210ms 38.210ms 1
aten::_to_copy 0.11% 46.000us 91.21% 38.154ms 38.154ms 19.000us 0.04% 38.201ms 38.201ms 1
aten::copy_ 0.07% 31.000us 91.07% 38.093ms 38.093ms 38.162ms 84.19% 38.162ms 38.162ms 1
cudaMemcpyAsync 90.99% 38.062ms 90.99% 38.062ms 38.062ms 0.000us 0.00% 0.000us 0.000us 1
aten::to 0.03% 11.000us 8.51% 3.558ms 3.558ms 8.000us 0.02% 3.587ms 3.587ms 1
aten::_to_copy 0.06% 27.000us 8.48% 3.547ms 3.547ms 16.000us 0.04% 3.579ms 3.579ms 1
aten::copy_ 0.03% 12.000us 8.40% 3.512ms 3.512ms 3.551ms 7.83% 3.551ms 3.551ms 1
cudaMemcpyAsync 8.37% 3.500ms 8.37% 3.500ms 3.500ms 0.000us 0.00% 0.000us 0.000us 1
aten::to 0.03% 11.000us 8.34% 3.490ms 3.490ms 8.000us 0.02% 3.531ms 3.531ms 1
aten::to 0.03% 11.000us 8.32% 3.481ms 3.481ms 8.000us 0.02% 3.516ms 3.516ms 1
aten::_to_copy 0.07% 28.000us 8.32% 3.479ms 3.479ms 16.000us 0.04% 3.523ms 3.523ms 1
aten::_to_copy 0.06% 27.000us 8.30% 3.470ms 3.470ms 15.000us 0.03% 3.508ms 3.508ms 1
aten::to 0.03% 11.000us 8.29% 3.466ms 3.466ms 7.000us 0.02% 3.508ms 3.508ms 1
aten::_to_copy 0.06% 27.000us 8.26% 3.455ms 3.455ms 15.000us 0.03% 3.501ms 3.501ms 1
aten::copy_ 0.03% 12.000us 8.23% 3.442ms 3.442ms 3.493ms 7.71% 3.493ms 3.493ms 1
------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 41.829ms
Self CUDA time total: 45.327ms
@turboderp I did some extra profiling using nvidia-smi dmon --gpm-metrics flag, and you can clearly see the difference in utilisation between latest commit (401fa8) and commit 1c2513 (that has a higher performance on h100 in my case)
I ran nvidia-smi dmon -s p --gpm-metrics=2,3,4,5,7,13 which gives me the following readings:
pwr - power consumption
smutil - SM Activity
smocc - SM Occupancy
intutil - Integer Activity
mmaact - Tensor Activity
hmmat - HMMA Tensor Activity
fp16 - FP16 Activity
run on commit 401fa8 ("slower")
pwr smutil smocc intutil mmaact hmmat fp16
W GPM:% GPM:% GPM:% GPM:% GPM:% GPM:%
91 0 0 0 0 0 0
195 0 0 0 0 0 0
253 42 17 17 9 9 1
288 69 30 43 1 1 4
290 85 36 53 1 1 5
289 85 37 53 1 1 5
291 85 37 53 1 1 5
276 85 37 53 1 1 5
287 85 37 53 1 1 5
290 80 33 41 10 10 4
291 85 36 53 1 1 5
291 85 36 53 1 1 5
292 85 37 53 1 1 5
292 85 37 53 1 1 5
288 80 33 41 10 10 4
289 85 36 53 1 1 5
290 85 36 53 1 1 5
291 85 36 53 1 1 5
290 85 37 53 1 1 5
291 85 37 53 1 1 5
281 85 37 53 1 1 5
304 82 35 51 1 1 5
290 84 34 43 10 10 4
291 85 36 53 1 1 5
291 85 36 53 1 1 5
293 85 37 53 1 1 5
293 85 37 53 1 1 5
285 80 33 41 10 10 4
290 80 33 41 10 10 4
291 85 36 53 1 1 5
291 85 36 53 1 1 5
292 85 37 53 1 1 5
293 85 37 53 1 1 5
240 85 37 53 1 1 5
94 72 31 45 1 1 4
93 0 0 0 0 0 0
run on commit 1c2513 ("faster")
pwr smutil smocc intutil mmaact hmmat fp16
W GPM:% GPM:% GPM:% GPM:% GPM:% GPM:%
87 0 0 0 0 0 0
87 0 0 0 0 0 0
230 54 22 19 10 10 3
250 58 22 30 1 1 5
295 78 30 40 2 2 7
295 78 30 40 2 2 7
298 78 30 40 2 2 7
282 78 30 40 2 2 7
306 74 28 38 2 2 7
298 81 32 34 10 10 6
298 78 29 40 2 2 7
301 78 30 40 2 2 7
310 78 30 40 2 2 7
296 75 30 34 7 7 6
301 80 31 38 5 5 7
301 78 30 40 2 2 7
303 78 30 40 2 2 7
304 78 30 40 2 2 7
296 77 31 31 10 10 5
300 77 29 40 2 2 7
303 78 30 40 2 2 7
300 78 30 40 2 2 7
297 78 30 40 2 2 7
299 76 30 31 10 10 5
298 78 29 40 2 2 7
299 78 30 40 2 2 7
300 78 30 40 2 2 7
131 78 30 40 2 2 7
92 16 6 8 0 0 1
92 0 0 0 0 0 0
"faster" commit has higher fp16 and tensor activity as well as lower integer activity and sm utilization/occupancy. Does it give you any insights on the issue?
There is something fishy going on for sure. SM utilization is usually a good thing. It's apparently doing extra work for some reason...? Higher GPU power consumption too. I'm very baffled.
But, I'm going to be adding Graphs soon, and it may completely change how the H100 reacts to the model. So probably best to wait and see what happens with that first.
@dvoidus : Well, I've put graphs on hold for now, because it turns out there's too much overhead per graph launch for it to be beneficial until I compile basically the whole decoder block into a single graph, and that's a huge rewrite. However, tracing showed that there was actually a lot more overhead from PyTorch than I was expecting, so I've been addressing that instead.
I'd be curious how the latest commit performs on the H100.
I tried, but still see the same speeds (27t/s for the worst case) The only reproducible difference is context inference speed increased from 3500 t/s to 4500 t/s For the reference "fastest" commit (bec6c9) gives 37 t/s and 3300 t/s for context
BTW, on the 4090 it works without degradation and actually faster (as everyone else reported): 40 t/s worst case and 3000 t/s for context
I did a test on the latest commit (77545c) and bec6c9 on h100 with 30b model and I can see stable performance degradation.
thoughts?