How to deal with timing spikes of codelets?

Muxas commented 5 months ago

Hi! I have noticed, that performance of cuBLAS matrix multiplication of a specific size (4096x2560 @ 2560x2560 -> 4096x2560, TF32 type) on a server with 8x GPUs H100 is measured incorrectly by StarPU-1.3.11 in certain circumstances. Judging by a perfmodel file from .starpu directory, one-two GPUs perform at 280 Tflops/s, while others perform at 0.5-5 Tflops/s. I experienced something similar before, but not at such a scale. Due to such a behavior, all matrix multiplications are scheduled on 1-2 GPUs with correct timings and others are simply idle. What can influence measurements? I tried STARPU_WORKERS_NOBIND=1 but it did not help.

Normal execution time: around 190 usec Abnormal execution time: around 70-100 msec

It seems to me, there is some race to acquire some mutex.

Here is an example codelet sampling file:

##################
# Performance Model Version
45

####################
# COMBs
# number of combinations
8
####################
# COMB_1
# number of types devices
1
####################
# DEV_0
# device type (CPU - 0, CUDA - 1, OPENCL - 2, MIC - 3, MPI_MS - 5)
1
####################
# DEV_0
# device id 
0
####################
# DEV_0
# number of cores 
1
##########
# number of implementations
1
#####
# Model for cuda0_impl0 (Comb1)
# number of entries
3
# sumlnx    sumlnx2     sumlny      sumlnxlny   alpha       beta        n   minx        maxx
0.000000e+00    0.000000e+00    0.000000e+00    0.000000e+00    nan             nan             0   0               0              
# a     b       c
nan             nan             nan            
# not multiple-regression-base
0
# hash      size        flops       mean (us)   dev (us)    sum     sum2        n
1cb683be    1405091840      1.073742e+12    3.451158e+03    3.505705e+02    2.101410e+07    7.327131e+10    6089
aa5ef08e    150994944       8.589935e+10    2.299876e+02    4.303181e+01    1.277696e+07    3.041417e+09    55555
8a80bc0d    110100480       5.368709e+10    1.863488e+02    2.769234e+01    2.407540e+08    4.585497e+10    1291954

####################
# COMB_6
# number of types devices
1
####################
# DEV_0
# device type (CPU - 0, CUDA - 1, OPENCL - 2, MIC - 3, MPI_MS - 5)
1
####################
# DEV_0
# device id 
7
####################
# DEV_0
# number of cores 
1
##########
# number of implementations
1
#####
# Model for cuda7_impl0 (Comb6)
# number of entries
3
# sumlnx    sumlnx2     sumlny      sumlnxlny   alpha       beta        n   minx        maxx
0.000000e+00    0.000000e+00    0.000000e+00    0.000000e+00    nan             nan             0   0               0              
# a     b       c
nan             nan             nan            
# not multiple-regression-base
0
# hash      size        flops       mean (us)   dev (us)    sum     sum2        n
1cb683be    1405091840      1.073742e+12    3.435652e+03    3.816336e+02    4.401070e+06    1.530711e+10    1281
aa5ef08e    150994944       8.589935e+10    2.699349e+02    4.810193e+01    4.130004e+05    1.150233e+08    1530
8a80bc0d    110100480       5.368709e+10    1.116287e+05    1.795608e+04    1.339545e+06    1.534007e+11    12

####################
# COMB_2
# number of types devices
1
####################
# DEV_0
# device type (CPU - 0, CUDA - 1, OPENCL - 2, MIC - 3, MPI_MS - 5)
1
####################
# DEV_0
# device id 
2
####################
# DEV_0
# number of cores 
1
##########
# number of implementations
1
#####
# Model for cuda2_impl0 (Comb2)
# number of entries
3
# sumlnx    sumlnx2     sumlny      sumlnxlny   alpha       beta        n   minx        maxx
0.000000e+00    0.000000e+00    0.000000e+00    0.000000e+00    nan             nan             0   0               0              
# a     b       c
nan             nan             nan            
# not multiple-regression-base
0
# hash      size        flops       mean (us)   dev (us)    sum     sum2        n
1cb683be    1405091840      1.073742e+12    3.501562e+03    3.195432e+02    7.265741e+06    2.565331e+10    2075
aa5ef08e    150994944       8.589935e+10    2.608466e+02    4.561439e+01    3.373529e+06    9.068829e+08    12933
8a80bc0d    110100480       5.368709e+10    7.372109e+04    7.005875e+03    8.109320e+05    6.032270e+10    11

####################
# COMB_7
# number of types devices
1
####################
# DEV_0
# device type (CPU - 0, CUDA - 1, OPENCL - 2, MIC - 3, MPI_MS - 5)
1
####################
# DEV_0
# device id 
1
####################
# DEV_0
# number of cores 
1
##########
# number of implementations
1
#####
# Model for cuda1_impl0 (Comb7)
# number of entries
3
# sumlnx    sumlnx2     sumlny      sumlnxlny   alpha       beta        n   minx        maxx
0.000000e+00    0.000000e+00    0.000000e+00    0.000000e+00    nan             nan             0   0               0              
# a     b       c
nan             nan             nan            
# not multiple-regression-base
0
# hash      size        flops       mean (us)   dev (us)    sum     sum2        n
1cb683be    1405091840      1.073742e+12    3.442832e+03    3.767360e+02    1.979973e+07    6.898339e+10    5751
aa5ef08e    150994944       8.589935e+10    2.317106e+02    4.405985e+01    9.333302e+05    2.240819e+08    4028
8a80bc0d    110100480       5.368709e+10    1.001149e+05    1.981605e+04    1.401609e+06    1.458194e+11    14

####################
# COMB_3
# number of types devices
1
####################
# DEV_0
# device type (CPU - 0, CUDA - 1, OPENCL - 2, MIC - 3, MPI_MS - 5)
1
####################
# DEV_0
# device id 
5
####################
# DEV_0
# number of cores 
1
##########
# number of implementations
1
#####
# Model for cuda5_impl0 (Comb3)
# number of entries
3
# sumlnx    sumlnx2     sumlny      sumlnxlny   alpha       beta        n   minx        maxx
0.000000e+00    0.000000e+00    0.000000e+00    0.000000e+00    nan             nan             0   0               0              
# a     b       c
nan             nan             nan            
# not multiple-regression-base
0
# hash      size        flops       mean (us)   dev (us)    sum     sum2        n
1cb683be    1405091840      1.073742e+12    3.484940e+03    3.377815e+02    5.757122e+06    2.025171e+10    1652
aa5ef08e    150994944       8.589935e+10    2.394166e+02    4.286079e+01    9.004460e+06    2.224909e+09    37610
8a80bc0d    110100480       5.368709e+10    1.895066e+02    2.734743e+01    2.083674e+08    4.030932e+10    1099526

####################
# COMB_5
# number of types devices
1
####################
# DEV_0
# device type (CPU - 0, CUDA - 1, OPENCL - 2, MIC - 3, MPI_MS - 5)
1
####################
# DEV_0
# device id 
6
####################
# DEV_0
# number of cores 
1
##########
# number of implementations
1
#####
# Model for cuda6_impl0 (Comb5)
# number of entries
3
# sumlnx    sumlnx2     sumlny      sumlnxlny   alpha       beta        n   minx        maxx
0.000000e+00    0.000000e+00    0.000000e+00    0.000000e+00    nan             nan             0   0               0              
# a     b       c
nan             nan             nan            
# not multiple-regression-base
0
# hash      size        flops       mean (us)   dev (us)    sum     sum2        n
1cb683be    1405091840      1.073742e+12    3.432383e+03    3.814117e+02    2.244779e+06    7.800081e+09    654
aa5ef08e    150994944       8.589935e+10    2.619133e+02    5.099834e+01    2.037686e+05    5.539314e+07    778
8a80bc0d    110100480       5.368709e+10    6.981000e+04    3.536405e+03    8.377200e+05    5.863131e+10    12

####################
# COMB_0
# number of types devices
1
####################
# DEV_0
# device type (CPU - 0, CUDA - 1, OPENCL - 2, MIC - 3, MPI_MS - 5)
1
####################
# DEV_0
# device id 
4
####################
# DEV_0
# number of cores 
1
##########
# number of implementations
1
#####
# Model for cuda4_impl0 (Comb0)
# number of entries
3
# sumlnx    sumlnx2     sumlny      sumlnxlny   alpha       beta        n   minx        maxx
0.000000e+00    0.000000e+00    0.000000e+00    0.000000e+00    nan             nan             0   0               0              
# a     b       c
nan             nan             nan            
# not multiple-regression-base
0
# hash      size        flops       mean (us)   dev (us)    sum     sum2        n
1cb683be    1405091840      1.073742e+12    3.504279e+03    3.266499e+02    2.064721e+07    7.298229e+10    5892
aa5ef08e    150994944       8.589935e+10    2.714276e+02    4.835834e+01    1.412781e+06    3.956396e+08    5205
8a80bc0d    110100480       5.368709e+10    7.524339e+04    9.860543e+03    9.029207e+05    6.910558e+10    12

####################
# COMB_4
# number of types devices
1
####################
# DEV_0
# device type (CPU - 0, CUDA - 1, OPENCL - 2, MIC - 3, MPI_MS - 5)
1
####################
# DEV_0
# device id 
3
####################
# DEV_0
# number of cores 
1
##########
# number of implementations
1
#####
# Model for cuda3_impl0 (Comb4)
# number of entries
3
# sumlnx    sumlnx2     sumlny      sumlnxlny   alpha       beta        n   minx        maxx
0.000000e+00    0.000000e+00    0.000000e+00    0.000000e+00    nan             nan             0   0               0              
# a     b       c
nan             nan             nan            
# not multiple-regression-base
0
# hash      size        flops       mean (us)   dev (us)    sum     sum2        n
1cb683be    1405091840      1.073742e+12    3.448032e+03    3.656675e+02    7.482230e+06    2.608913e+10    2170
aa5ef08e    150994944       8.589935e+10    2.692083e+02    4.260067e+01    7.593020e+06    2.095291e+09    28205
8a80bc0d    110100480       5.368709e+10    8.147988e+04    1.603222e+04    8.962787e+05    7.585604e+10    11

sthibaul commented 3 months ago

We have never seen such a thing. Are you sure it's not the kernel that takes that much amount of time?

It could be useful to inspect with gdb or such in what situation threads usually are on the bogus GPUs, to check what is hanging.

It seems to me, there is some race to acquire some mutex.

That seems very unlikely to me, we have not seen such a thing. Perhaps try with the lws scheduler that really minimizes interaction between workers.

Are you using scheduling contexts?

Muxas commented 3 months ago

We shifted from StarPU version 1.3.11 to version 1.4.7 and the issue has gone away. Issue is not related to the CUDA kernel of code let, as it simply calls cuBLAS. I believe that time tracking of codelets was substantially reworked in 1.4 version and the issue is no more.

Even without this example, I saw some troubles with certain kernels as soon as memory consumption was a little larger, than a GPU could handle (set by environment variable). In this case time was growing up 100x, as if GPU was actually using CPU memory through a unified memory mechanism.

sthibaul commented 3 months ago

I believe that time tracking of codelets was substantially reworked in 1.4 version and the issue is no more

Well, no, the time tracking has not changed at all...

starpu-runtime / starpu

How to deal with timing spikes of codelets? #50