Drastical performance degradation when switching from StarPU-1.3.11 to StarPU-1.4.4 on a GPU node

Muxas commented 7 months ago

The Issue

On a GPU node when switching from StarPU version 1.3.11 to 1.4 versions we experience strange performance drop. For our new software NNTile it results in a 10x performance drop. Yes, it goes from 100% to only 10% percent.

Attempt to switch to a master branch (commit 50cf74508 at Inria gitlab repository) leads to different errors, related to data transfers between CPU and GPU. We tried some other commits from master branch and realized, that they only work with CPU and something strange with memory manager happens when it goes to GPU nodes. DARTS scheduler always fails, while DM and DMDA schedulers fail for some commits (e.g., 50cf74508) and work correctly for other commits (e.g., 2b8a91fe). I cannot present the output of master branch experiments right now, as this current issue is about performance degradation of 1.4 series of StarPU releases.

Although 10x performance drop happens on our new software, I prepared a simple example that shows performance for versions 1.2.10, 1.3.11 and 1.4.4. Most performance drop for the simple example happened when switching from 1.2.10 version to 1.3.11.

Steps to reproduce

I have implemented a simple test https://github.com/Muxas/starpu_gemm_redux to reproduce the issue. The repo simply implements several chains of matrix multiplications:

C[i] = A[i][0]*B[i][0] + A[i][1]*B[i][1] + ... +A[i][NB-1]*B[i][NB-1]

for i in range from 0 to D-1.

which can be simply described with the following C code (the first order of task submissions):

for(int r = 0; r < R; ++r) // Number of repeats
{
    for(int i = 0; i < NB; ++i) // Number of A and B matrices in each chain of matrix multiplications
    {
        for(int j = 0; j < D; ++j) // Number of output C matrices
        {
            starpu_task_insert(&gemm_cl, STARPU_R, A[i*D+j], STARPU_R, B[i*D+j],
                    C_mode, C[j], 0);
        }
    }
}

or with the following C code (the other order of task submissions):

for(int r = 0; r < R; ++r) // Number of repeats
{
    for(int j = 0; j < D; ++j) // Number of output C matrices
    {
        for(int i = 0; i < NB; ++i) // Number of A and B matrices in each chain of matrix multiplications
        {
            starpu_task_insert(&gemm_cl, STARPU_R, A[i*D+j], STARPU_R, B[i*D+j],
                    C_mode, C[j], 0);
        }
    }
}

Matrices A are of size M-by-K, matrices B are of size K-by-N and matrices C are of size M-by-N. No transpositions in matrix multiplications.

Our results are produced on a HGX node with 8 (eight) Nvidia A100 80GB SXM GPUs. We compiled the code and run two experimental setups:

M=N=K=1024, D=32, NB=100, R=50. with and without STARPU_REDUX access mode for the matrices C.
M=256, N=K=1532, D=32, NB=100, R=50. with and without STARPU_REDUX access mode for the matrices C.

StarPU-1.4.4 behavior

This section presents plots for the StarPU-1.4.4 version. The first plot shows warmup time (done by the first order of task submission), time for the first order of task submission and time for the other way of task submission with STARPU_RW|STARPU_COMMUTE access mode for the matrices C and M=N=K=1024:

1024_1024_1024_mode0

The second plot shows the same timings but for the STARPU_REDUX access mode for the matrices C:

1024_1024_1024_mode1

The third plot shows timings for M=256 and N=K=1532 with STARPU_RW|STARPU_COMMUTE access mode:

256_1536_1536_mode0

And the last plot in this section (for the STARPU_REDUX access mode):

256_1536_1536_mode1

We see, that most dumb scheduling algorithm, namely eager, outperforms smarter ones.

StarPU-1.3.11 behavior

This section presents plots for StarPU of version 1.3.11 in the same order as above.

1 3 11-1024_1024_1024_mode0

1 3 11-1024_1024_1024_mode1

1 3 11-256_1536_1536_mode0

1 3 11-256_1536_1536_mode1

We see, that most dumb scheduling algorithm, namely eager, outperforms smarter ones.

StarPU-1.2.10 behavior

This section presents plots for StarPU of version 1.2.10 in the same order as above.

1 2 10-1024_1024_1024_mode0

1 2 10-1024_1024_1024_mode1

1 2 10-256_1536_1536_mode0

1 2 10-256_1536_1536_mode1

Here we see, that in case of STARPU_RW|STARPU_COMMUTE access mode smart schedulers DMDA and DMDAR perform nearly perfectly, just as EAGER. The problem with DMDA and DMDAR appears when switching to 1.3.11 or 1.4.4 StarPU version.

Configuration

The configure line we used is within config.log files in the section below.

Configuration result

This is a config file for StarPU-1.2.10: config-1.2.10.log

This is a config file for StarPU-1.3.11: config-1.3.11.log

This is a config file for StarPU-1.4.4: config-1.4.4.log

Distribution

Inria Gitlab repository

Version of StarPU

We used starpu-1.3.11 and starpu-1.4.4 tags of Inria GitLab repository

Version of GPU drivers

We use CUDA 12.3, hwloc 2.9.3

sthibaul commented 6 months ago

configure just finds hwloc-calc from $PATH

Muxas commented 6 months ago

Recompiling everything from source (except cuBLAS) on a compute node leads to a very strange performance of cublasGemmEx on a server with A100 GPUs. Starpu-1.4 compiled on compute node:

# hash      size        flops       mean (us or J)  dev (us or J)   sum     sum2        n
0a5b7fae    1971322880      2.147484e+12    2.185368e+04    4.096205e+03    2.141661e+06    4.844750e+10    98
146137f2    234881024       1.717987e+11    1.011980e+04    2.119861e+03    1.042339e+07    1.101112e+11    1030
22528761    272629760       2.147484e+11    9.720931e+03    2.044775e+03    6.126130e+07    6.218662e+11    6302

Starpu-1.3 compiled on a host node:

# hash      size        flops       mean (us)   dev (us)    sum     sum2        n
0a5b7fae    1971322880      2.147484e+12    1.667207e+04    8.012294e+02    4.396424e+07    7.346677e+11    2637
146137f2    234881024       1.717987e+11    1.350489e+03    8.248709e+01    2.934882e+07    3.978311e+10    21732
22528761    272629760       2.147484e+11    1.703173e+03    8.857789e+01    2.417620e+08    4.128764e+11    141948

The same device, but performance of 4096x5120 by 5120x5120 matrix multiplications (hash 22528761) is like 6 times different.

sthibaul commented 6 months ago

It's hard to comment on this without seeing what is happening around, such as with a paje trace. You can also try to cudaStreamSynchronize() in your kernel and drop the STARPU_CUDA_ASYNC flag, to check whether the kernel itself behaves differently. Possibly you end up using cublasv1 vs cublasv2 or such difference in implementation.

Muxas commented 6 months ago

It's hard to comment on this without seeing what is happening around, such as with a paje trace.

Here it is paje.trace.tar.gz

Possibly you end up using cublasv1 vs cublasv2 or such difference in implementation.

I explicitly include <cublas_v2.h> and <starpu_cublas_v2.h>. And I am using CUDA 12, so using cublas_v1 and cublas_v2 simultaneously is impossible.

Muxas commented 6 months ago

And, for a reference, a paje.trace for access-node-compiled StarPU-1.4 (libnvidia-ml is disabled) host.paje.trace.tar.gz Performance of computations is around 3 times better.

sthibaul commented 6 months ago

One thing I notice in the compute-node-built case is that there are a lot of 4µs "overhead" states here and there in the trace on the lower part of the T3 bars (below the CUDA bars), which represents the state of the thread driving the gpu. These don't show up on the access-node-compiled case. I guess that could be some cuda operation triggered perhaps by the presence of nvidia-ml which for some reason takes a lot of time. Could you post the config.log obtained on compute-node-built and access-node-built so we make sure to know what compilation difference there is?

Also, I notice that you have different-but-quite-close data sizes, and a lot of allocating/freeing states. You probably want to round up allocations to e.g. 10% of your data size, so that starpu can reuse data allocations rather than freeing/allocating all the time, that'll avoid a lot of synchronizations. I have just added a faq about it on https://gitlab.inria.fr/starpu/starpu/-/blob/master/doc/doxygen/chapters/starpu_faq/check_list_performance.doxy#L62

Muxas commented 6 months ago

Also, I notice that you have different-but-quite-close data sizes

I have a pipeline of computations. Pipeline operates on tiles of shapes 1, 4096, 5120, 4096x5120, 5120x5120, and 4096x51200. Sizes never change. Shapes 4096x5120 and 5120x5120 are indeed close, but does data allocation reuse require ALL tiles to be of the same shape? That would be strange.

Muxas commented 6 months ago

Could you post the config.log obtained on compute-node-built and access-node-built so we make sure to know what compilation difference there is?

Host (login node): config_host.log

Compute node: config_compute.log

Muxas commented 6 months ago

You can also try to cudaStreamSynchronize() in your kernel and drop the STARPU_CUDA_ASYNC flag, to check whether the kernel itself behaves differently.

Adding such a sync only to a single gemm kernel did not change the picture much:

# hash      size        flops       mean (us or J)  dev (us or J)   sum     sum2        n
0a5b7fae    1971322880      2.147484e+12    1.969021e+04    1.245154e+03    1.988711e+06    3.931472e+10    101
146137f2    234881024       1.717987e+11    5.597694e+03    6.308121e+02    4.466960e+06    2.532222e+10    798
22528761    272629760       2.147484e+11    5.917192e+03    6.071566e+02    2.884039e+07    1.724509e+11    4874

Without the sync:

# hash      size        flops       mean (us or J)  dev (us or J)   sum     sum2        n
0a5b7fae    1971322880      2.147484e+12    2.185368e+04    4.096205e+03    2.141661e+06    4.844750e+10    98
146137f2    234881024       1.717987e+11    1.011980e+04    2.119861e+03    1.042339e+07    1.101112e+11    1030
22528761    272629760       2.147484e+11    9.720931e+03    2.044775e+03    6.126130e+07    6.218662e+11    6302

Yes, performance got up twice, but it is still far from performance of StarPU-1.3 compiled on an access (login) node:

# hash      size        flops       mean (us)   dev (us)    sum     sum2        n
0a5b7fae    1971322880      2.147484e+12    1.667207e+04    8.012294e+02    4.396424e+07    7.346677e+11    2637
146137f2    234881024       1.717987e+11    1.350489e+03    8.248709e+01    2.934882e+07    3.978311e+10    21732
22528761    272629760       2.147484e+11    1.703173e+03    8.857789e+01    2.417620e+08    4.128764e+11    141948

Muxas commented 6 months ago

Actually, as you can see, there are only 3 different hashes of the gemm kernel. My tiles are really mostly 4096x5120 and 5120x5120.

sthibaul commented 6 months ago

Could you post the config.log obtained on compute-node-built and access-node-built so we make sure to know what compilation difference there is?

Host (login node): config_host.log

Compute node: config_compute.log

Thanks! Do you have STARPU_PROFILING enabled? I realize that nvmlDeviceGetTotalEnergyConsumption is actually terribly expensive, and is a difference between the two configs. In master, 1.3 and 1.4 branches I have now disabled its use by default.

sthibaul commented 6 months ago

You can also try to cudaStreamSynchronize() in your kernel and drop the STARPU_CUDA_ASYNC flag, to check whether the kernel itself behaves differently.

Adding such a sync only to a single gemm kernel did not change the picture much:

# hash        size        flops       mean (us or J)  dev (us or J)   sum     sum2        n
0a5b7fae  1971322880      2.147484e+12    1.969021e+04    1.245154e+03    1.988711e+06    3.931472e+10    101
146137f2  234881024       1.717987e+11    5.597694e+03    6.308121e+02    4.466960e+06    2.532222e+10    798
22528761  272629760       2.147484e+11    5.917192e+03    6.071566e+02    2.884039e+07    1.724509e+11    4874

Without the sync:

# hash        size        flops       mean (us or J)  dev (us or J)   sum     sum2        n
0a5b7fae  1971322880      2.147484e+12    2.185368e+04    4.096205e+03    2.141661e+06    4.844750e+10    98
146137f2  234881024       1.717987e+11    1.011980e+04    2.119861e+03    1.042339e+07    1.101112e+11    1030
22528761  272629760       2.147484e+11    9.720931e+03    2.044775e+03    6.126130e+07    6.218662e+11    6302

Yes, performance got up twice, but it is still far from performance of StarPU-1.3 compiled on an access (login) node:

# hash        size        flops       mean (us)   dev (us)    sum     sum2        n
0a5b7fae  1971322880      2.147484e+12    1.667207e+04    8.012294e+02    4.396424e+07    7.346677e+11    2637
146137f2  234881024       1.717987e+11    1.350489e+03    8.248709e+01    2.934882e+07    3.978311e+10    21732
22528761  272629760       2.147484e+11    1.703173e+03    8.857789e+01    2.417620e+08    4.128764e+11    141948

I wasn't really planning for a performance increase, but mostly for more stable measurement. The deviation is really large. The nvmlDeviceGetTotalEnergyConsumption calls could be explaining that.

sthibaul commented 6 months ago

Also, I notice that you have different-but-quite-close data sizes

I have a pipeline of computations. Pipeline operates on tiles of shapes 1, 4096, 5120, 4096x5120, 5120x5120, and 4096x51200. Sizes never change. Shapes 4096x5120 and 5120x5120 are indeed close, but does data allocation reuse require ALL tiles to be of the same shape? That would be strange.

No, but one cannot directly reuse the allocation for a different tile size, so if the global ratios of the different data shapes vary along the workload, one has to free/allocate to cope with the new ratios. That can explain the amount of reallocation. You may want to try to set STARPU_MINIMUM_AVAILABLE_MEM=5 and STARPU_TARGET_AVAILABLE_MEM=10, so that memory is immediately available for different shapes, rather than evicting data at the last minute. You might end up with prefetching fighting with pre-eviction when there are a lot of ready tasks, but I'm interested in seeing the eventual trace.

Muxas commented 6 months ago

Do you have STARPU_PROFILING enabled?

Yes, it is nearly always on. STARPU_BUS_STATS=1 is ignored without overall profiling.

In master, 1.3 and 1.4 branches I have now disabled its use by default.

I tried latest starpu-1.4 commit and confirm performance model is now in a good shape:

# hash      size        flops       mean (us or J)  dev (us or J)   sum     sum2        n
22528761    272629760       2.147484e+11    1.638281e+03    9.637613e+01    3.550647e+07    5.837090e+10    21673
146137f2    234881024       1.717987e+11    1.282526e+03    1.013623e+02    3.500014e+06    4.516898e+09    2729
0a5b7fae    1971322880      2.147484e+12    1.564672e+04    1.066607e+03    1.179762e+07    1.854519e+11    754

before it was:

# hash      size        flops       mean (us or J)  dev (us or J)   sum     sum2        n
0a5b7fae    1971322880      2.147484e+12    1.969021e+04    1.245154e+03    1.988711e+06    3.931472e+10    101
146137f2    234881024       1.717987e+11    5.597694e+03    6.308121e+02    4.466960e+06    2.532222e+10    798
22528761    272629760       2.147484e+11    5.917192e+03    6.071566e+02    2.884039e+07    1.724509e+11    4874

Now we are back to fight against the scheduler, that tries to transmit more data, than in StarPU-1.3 version.

Muxas commented 6 months ago

You may want to try to set STARPU_MINIMUM_AVAILABLE_MEM=5 and STARPU_TARGET_AVAILABLE_MEM=10, so that memory is immediately available for different shapes, rather than evicting data at the last minute.

Preliminary tests on a previous week with these environment variables did not bring us performance. I will give it another try.

sthibaul commented 6 months ago

You may want to try to set STARPU_MINIMUM_AVAILABLE_MEM=5 and STARPU_TARGET_AVAILABLE_MEM=10, so that memory is immediately available for different shapes, rather than evicting data at the last minute.

Preliminary tests on a previous week with these environment variables did not bring us performance. I will give it another try.

At the beginning of execution the prefetch probably fights with eviction so that'd lose time, but we'd want to fix that at some point. I'm interested to see later in the execution, when there are much less ready tasks, thus much less prefetching and then no fight, we could hope for much less last-minute write-back.

Muxas commented 6 months ago

thus much less prefetching and then no fight, we could hope for much less last-minute write-back

That is why I wonder #35 if there is a way to tell StarPU that a given handle can be assumed "dirty" from now on without reallocating resource as starpu_invalidate_submit does. I, the programmer, guarantee that the next use of the handle will be in STARPU_W mode in the next iteration. I believe this hint will hep StarPU reduce amount of transferred data by a lot for my application of training neural networks.

Muxas commented 6 months ago

You may want to try to set STARPU_MINIMUM_AVAILABLE_MEM=5 and STARPU_TARGET_AVAILABLE_MEM=10, so that memory is immediately available for different shapes, rather than evicting data at the last minute.

Setting these parameters enabled data race. Trace is attached: datarace.paje.trace.tar.gz

Muxas commented 6 months ago

You may want to try to set STARPU_MINIMUM_AVAILABLE_MEM=5 and STARPU_TARGET_AVAILABLE_MEM=10, so that memory is immediately available for different shapes, rather than evicting data at the last minute.

Besides triggering watchdog sometimes, this change could not help to increase performance with StarPU-1.4. As of now, performance of my app with a single data parallel track on a single GPU reaches 100 Tflops/s. When I switch to 4 independent data parallel tracks on 4 GPUs, performance goes up to 360 Tflops/s with StarPU-1.3 and remains 100 Tflops/s with StarPU-1.4. For some strange reason StarPU-1.4 communicates much more data through slow CPU-GPU PCI-express bus instead of fast SXM4 bus. I could not believe this is only due to scheduling technique. Maybe there is a double prefetching of the same buffer? Since issue with NUMA indexing is solved and performance of StarPU-1.4 is still much lower than of StarPU-1.3 I would like to continue search. I could send traces, but they weight around 1.8GB each.

StarPU-1.4 (commit 159175aee64b7fa89f70b2ad6045d657fff1dc1a of gitlab):

Training performance: 106.04774427457744 Tflops/s
#---------------------
Data transfer stats:
    NUMA 0 -> CUDA 0    3512.1792 GB    5940.0140 MB/s  (transfers : 58013 - avg 61.9942 MB)
    CUDA 0 -> NUMA 0    378.2384 GB 639.7000 MB/s   (transfers : 10657 - avg 36.3438 MB)
    NUMA 0 -> CUDA 1    4125.6567 GB    6977.5645 MB/s  (transfers : 52147 - avg 81.0147 MB)
    CUDA 1 -> NUMA 0    641.2321 GB 1084.4912 MB/s  (transfers : 9130 - avg 71.9191 MB)
    CUDA 0 -> CUDA 1    2449.2275 GB    4142.2842 MB/s  (transfers : 38293 - avg 65.4952 MB)
    CUDA 1 -> CUDA 0    2818.3569 GB    4766.5784 MB/s  (transfers : 40335 - avg 71.5507 MB)
    NUMA 0 -> CUDA 2    3782.2424 GB    6396.7606 MB/s  (transfers : 51875 - avg 74.6606 MB)
    CUDA 2 -> NUMA 0    278.6027 GB 471.1899 MB/s   (transfers : 8306 - avg 34.3474 MB)
    CUDA 0 -> CUDA 2    2652.0833 GB    4485.3659 MB/s  (transfers : 40200 - avg 67.5556 MB)
    CUDA 2 -> CUDA 0    3210.4692 GB    5429.7424 MB/s  (transfers : 44222 - avg 74.3413 MB)
    CUDA 1 -> CUDA 2    2169.9846 GB    3670.0110 MB/s  (transfers : 32642 - avg 68.0738 MB)
    CUDA 2 -> CUDA 1    2613.1143 GB    4419.4589 MB/s  (transfers : 37885 - avg 70.6303 MB)
    NUMA 0 -> CUDA 3    4245.4556 GB    7180.1744 MB/s  (transfers : 52158 - avg 83.3496 MB)
    CUDA 3 -> NUMA 0    574.7733 GB 972.0917 MB/s   (transfers : 8547 - avg 68.8625 MB)
    CUDA 0 -> CUDA 3    2453.2351 GB    4149.0613 MB/s  (transfers : 37210 - avg 67.5118 MB)
    CUDA 3 -> CUDA 0    2310.4956 GB    3907.6514 MB/s  (transfers : 34640 - avg 68.3010 MB)
    CUDA 1 -> CUDA 3    2358.8792 GB    3989.4806 MB/s  (transfers : 35530 - avg 67.9846 MB)
    CUDA 3 -> CUDA 1    1971.4827 GB    3334.2919 MB/s  (transfers : 31633 - avg 63.8194 MB)
    CUDA 2 -> CUDA 3    2139.7852 GB    3618.9353 MB/s  (transfers : 31820 - avg 68.8605 MB)
    CUDA 3 -> CUDA 2    2551.2749 GB    4314.8718 MB/s  (transfers : 38917 - avg 67.1302 MB)
Total transfers: 47236.7695 GB

Starpu-1.3 (commit 11699e22f3125723fb475e33797a6dcdaaecb7d7 of gitlab):

Training performance: 326.13827992644457 Tflops/s
#---------------------
Data transfer stats:
    NUMA 0 -> CUDA 0    10.4181 GB  41.3259 MB/s    (transfers : 980 - avg 10.8859 MB)
    CUDA 0 -> NUMA 0    3.0865 GB   12.2433 MB/s    (transfers : 59 - avg 53.5694 MB)
    NUMA 0 -> CUDA 1    11.4906 GB  45.5799 MB/s    (transfers : 434 - avg 27.1114 MB)
    CUDA 1 -> NUMA 0    5.2519 GB   20.8328 MB/s    (transfers : 145 - avg 37.0892 MB)
    CUDA 0 -> CUDA 1    1055.8624 GB    4188.3069 MB/s  (transfers : 19097 - avg 56.6164 MB)
    CUDA 1 -> CUDA 0    1514.0365 GB    6005.7532 MB/s  (transfers : 25104 - avg 61.7580 MB)
    NUMA 0 -> CUDA 2    14.4211 GB  57.2045 MB/s    (transfers : 341 - avg 43.3057 MB)
    CUDA 2 -> NUMA 0    6.2903 GB   24.9518 MB/s    (transfers : 99 - avg 65.0631 MB)
    CUDA 0 -> CUDA 2    1609.3960 GB    6384.0162 MB/s  (transfers : 27502 - avg 59.9237 MB)
    CUDA 2 -> CUDA 0    1697.3870 GB    6733.0511 MB/s  (transfers : 27228 - avg 63.8359 MB)
    CUDA 1 -> CUDA 2    1199.5175 GB    4758.1442 MB/s  (transfers : 21281 - avg 57.7184 MB)
    CUDA 2 -> CUDA 1    1876.4987 GB    7443.5356 MB/s  (transfers : 29418 - avg 65.3183 MB)
    NUMA 0 -> CUDA 3    12.0974 GB  47.9870 MB/s    (transfers : 358 - avg 34.6026 MB)
    CUDA 3 -> NUMA 0    6.4070 GB   25.4149 MB/s    (transfers : 76 - avg 86.3264 MB)
    CUDA 0 -> CUDA 3    1819.2949 GB    7216.6235 MB/s  (transfers : 28441 - avg 65.5025 MB)
    CUDA 3 -> CUDA 0    1214.7228 GB    4818.4582 MB/s  (transfers : 20123 - avg 61.8137 MB)
    CUDA 1 -> CUDA 3    1747.9844 GB    6933.7543 MB/s  (transfers : 28541 - avg 62.7146 MB)
    CUDA 3 -> CUDA 1    1395.9900 GB    5537.4929 MB/s  (transfers : 24528 - avg 58.2801 MB)
    CUDA 2 -> CUDA 3    1206.3433 GB    4785.2183 MB/s  (transfers : 20884 - avg 59.1503 MB)
    CUDA 3 -> CUDA 2    2197.1829 GB    8715.5950 MB/s  (transfers : 33908 - avg 66.3535 MB)
Total transfers: 18603.6797 GB

starpu-runtime / starpu