starpu-runtime / starpu

This is a mirror of https://gitlab.inria.fr/starpu/starpu where our development happens, but contributions are welcome here too!
https://starpu.gitlabpages.inria.fr/
GNU Lesser General Public License v2.1
58 stars 13 forks source link

Drastical performance degradation when switching from StarPU-1.3.11 to StarPU-1.4.4 on a GPU node #33

Open Muxas opened 6 months ago

Muxas commented 6 months ago

The Issue

On a GPU node when switching from StarPU version 1.3.11 to 1.4 versions we experience strange performance drop. For our new software NNTile it results in a 10x performance drop. Yes, it goes from 100% to only 10% percent.

Attempt to switch to a master branch (commit 50cf74508 at Inria gitlab repository) leads to different errors, related to data transfers between CPU and GPU. We tried some other commits from master branch and realized, that they only work with CPU and something strange with memory manager happens when it goes to GPU nodes. DARTS scheduler always fails, while DM and DMDA schedulers fail for some commits (e.g., 50cf74508) and work correctly for other commits (e.g., 2b8a91fe). I cannot present the output of master branch experiments right now, as this current issue is about performance degradation of 1.4 series of StarPU releases.

Although 10x performance drop happens on our new software, I prepared a simple example that shows performance for versions 1.2.10, 1.3.11 and 1.4.4. Most performance drop for the simple example happened when switching from 1.2.10 version to 1.3.11.

Steps to reproduce

I have implemented a simple test https://github.com/Muxas/starpu_gemm_redux to reproduce the issue. The repo simply implements several chains of matrix multiplications:

C[i] = A[i][0]*B[i][0] + A[i][1]*B[i][1] + ... +A[i][NB-1]*B[i][NB-1]

for i in range from 0 to D-1.

which can be simply described with the following C code (the first order of task submissions):

for(int r = 0; r < R; ++r) // Number of repeats
{
    for(int i = 0; i < NB; ++i) // Number of A and B matrices in each chain of matrix multiplications
    {
        for(int j = 0; j < D; ++j) // Number of output C matrices
        {
            starpu_task_insert(&gemm_cl, STARPU_R, A[i*D+j], STARPU_R, B[i*D+j],
                    C_mode, C[j], 0);
        }
    }
}

or with the following C code (the other order of task submissions):

for(int r = 0; r < R; ++r) // Number of repeats
{
    for(int j = 0; j < D; ++j) // Number of output C matrices
    {
        for(int i = 0; i < NB; ++i) // Number of A and B matrices in each chain of matrix multiplications
        {
            starpu_task_insert(&gemm_cl, STARPU_R, A[i*D+j], STARPU_R, B[i*D+j],
                    C_mode, C[j], 0);
        }
    }
}

Matrices A are of size M-by-K, matrices B are of size K-by-N and matrices C are of size M-by-N. No transpositions in matrix multiplications.

Our results are produced on a HGX node with 8 (eight) Nvidia A100 80GB SXM GPUs. We compiled the code and run two experimental setups:

  1. M=N=K=1024, D=32, NB=100, R=50. with and without STARPU_REDUX access mode for the matrices C.
  2. M=256, N=K=1532, D=32, NB=100, R=50. with and without STARPU_REDUX access mode for the matrices C.

StarPU-1.4.4 behavior

This section presents plots for the StarPU-1.4.4 version. The first plot shows warmup time (done by the first order of task submission), time for the first order of task submission and time for the other way of task submission with STARPU_RW|STARPU_COMMUTE access mode for the matrices C and M=N=K=1024:

1024_1024_1024_mode0

The second plot shows the same timings but for the STARPU_REDUX access mode for the matrices C:

1024_1024_1024_mode1

The third plot shows timings for M=256 and N=K=1532 with STARPU_RW|STARPU_COMMUTE access mode:

256_1536_1536_mode0

And the last plot in this section (for the STARPU_REDUX access mode):

256_1536_1536_mode1

We see, that most dumb scheduling algorithm, namely eager, outperforms smarter ones.

StarPU-1.3.11 behavior

This section presents plots for StarPU of version 1.3.11 in the same order as above.

1 3 11-1024_1024_1024_mode0

1 3 11-1024_1024_1024_mode1

1 3 11-256_1536_1536_mode0

1 3 11-256_1536_1536_mode1

We see, that most dumb scheduling algorithm, namely eager, outperforms smarter ones.

StarPU-1.2.10 behavior

This section presents plots for StarPU of version 1.2.10 in the same order as above.

1 2 10-1024_1024_1024_mode0

1 2 10-1024_1024_1024_mode1

1 2 10-256_1536_1536_mode0

1 2 10-256_1536_1536_mode1

Here we see, that in case of STARPU_RW|STARPU_COMMUTE access mode smart schedulers DMDA and DMDAR perform nearly perfectly, just as EAGER. The problem with DMDA and DMDAR appears when switching to 1.3.11 or 1.4.4 StarPU version.

Configuration

The configure line we used is within config.log files in the section below.

Configuration result

This is a config file for StarPU-1.2.10: config-1.2.10.log

This is a config file for StarPU-1.3.11: config-1.3.11.log

This is a config file for StarPU-1.4.4: config-1.4.4.log

Distribution

Inria Gitlab repository

Version of StarPU

We used starpu-1.3.11 and starpu-1.4.4 tags of Inria GitLab repository

Version of GPU drivers

We use CUDA 12.3, hwloc 2.9.3

Muxas commented 6 months ago

Dear StarPU team,

I think I figured out what is the reason for 10x performance drop of my application. I disabled kernels and printed bus stats for 1.3.11 and 1.4.4 versions of StarPU.

StarPU-1.3.11:

Total transfers: 29525.0176 GB
Real time including initialization: 4:58.20
Training performance: 1039.465649305438 Tflops/s

StarPU-1.4.4:

Total transfers: 44654.4844 GB
Real time: 17:05.05
Training performance: 58.44558141509535 Tflops/s

For some reason DMDAR and other DM** schedulers in StarPU-1.4.4 send nearly twice more data. And if I specifically take a look at the slowest part, namely PCI-express connection between CPU and GPUs, the new 1.4.4 version sends 65 times more data compared to old version 1.3.11.

Could you please advice if there is a way in StarPU-1.4.4 to put this data transmission overload of 1.4.4 StarPU back to where it was with 1.3.11 version?

I believe there is something wrong with the Memory Manager.

P.S. Enabling CUDA memory map leads to the following error: ../../src/datawizard/copy_driver.c:312: _starpu_driver_copy_data_1_to_1: Assertion `0 && "src_replicate->memory_node == dst_replicate->mapped"' failed.

P.P.S Using STARPU_REDUX leads to another but similar error. Seems like memory manager is bugged in 1.4.4 StarPU.

sthibaul commented 6 months ago

This is unexpected of course :)

Particularly since the 1.3 series introduces heuristics which are precisely meant to improve the overall flow of data.

AIUI, the involved matrices can completely fit in even just one GPU?

Could you also post results with starpu 1.3.0? To make sure whether it's the 1.2->1.3 development that introduced the first regression, or possibly some backports from the 1.4.x series to the 1.3.x series.

Could you also post the output of starpu_machine_display obtained with the different versions, to make sure dmda gets the correct understanding of available PCI bandwidths, gpu placement etc.

Ideally, if you could provide your testcase with an LGPL-2.1+ licence, we could integrate it in our testsuite, and with simulation support we could add non-regression check-up.

Using STARPU_REDUX leads to another but similar error

That's not precise enough for us to be able to act :)

Muxas commented 6 months ago

Output for the StarPU-1.3.0.

M=N=K=1024, mode=STARPU_RW|STARPU_COMMUTE 1 3 0-1024_1024_1024_mode0

M=N=K=1024, mode=STARPU_REDUX 1 3 0-1024_1024_1024_mode1

M=256, N=K=1532, mode=STARPU_RW|STARPU_COMMUTE 1 3 0-256_1536_1536_mode0

M=256, N=K=1532, mode=STARPU_REDUX 1 3 0-256_1536_1536_mode1

It seems, that 1.3.0 performs similar to 1.2.10.

Muxas commented 6 months ago

Using STARPU_REDUX leads to another but similar error

That's not precise enough for us to be able to act :) As soon as we find out the source of increased timing with 1.3.11 version, I will prepare some minimal example where CUDA map and STARPU_REDUX lead to internal StarPU assertion failures.

Muxas commented 6 months ago

Comparison of 1.2.10 vs 1.3.0

I believe I misjudged the plots of StarPU-1.3.0. Take a look at the most right graphs.

StarPU-1.2.10: 1 2 10-256_1536_1536_mode0 StarPU-1.3.0: 1 3 0-256_1536_1536_mode0

There is a gap between Eager and other smart schedulers. And then, for the StarPU-1.3.5 (below) the gap becomes larger.

Another portion of plots. This time for the StarPU-1.3.5.

M=N=K=1024, mode=STARPU_RW|STARPU_COMMUTE 1 3 5-1024_1024_1024_mode0

M=N=K=1024, mode=STARPU_REDUX 1 3 5-1024_1024_1024_mode1

M=256, N=K=1532, mode=STARPU_RW|STARPU_COMMUTE 1 3 5-256_1536_1536_mode0

M=256, N=K=1532, mode=STARPU_REDUX 1 3 5-256_1536_1536_mode1

Muxas commented 6 months ago

And another update

This time I tried other application NNTile

I took a look at data transfers and total execution time (reported by /usr/bin/time utility) for different versions. Total amount of transferred data sorted in descending order:

  1. StarPU-1.2.10: transmitted 64987 GB, execution time 55:32.68 minutes
  2. StarPU-1.4.4: transmitted 56783 GB, execution time 27:25.60 minutes
  3. StarPU-1.3.0: transmitted 29539 GB, execution time 5:36.09 minutes
  4. StarPU-1.3.11: transmitted 29313 GB, execution time 5:54.13 minutes

As one can see, 1.3.x indeed saves a lot of data transmissions. However, 1.4.4 version brings all those transmissions back. Seems like some old code was brought back by 1.4.x release chain. The main problem comes from CPU<->GPU transfers, as 1.4.4 version transfers through a slow PCI-e bus around 65 times more than 1.3.11 version.

Files

Here are the more detailed transmission reports provided by STARPU_BUS_STATS=1 env variable.

transfers_starpu_1.2.10_dmdar.txt

transfers_starpu_1.3.0_dmdar.txt

transfers_starpu_1.3.11_dmdar.txt

transfers_starpu_1.4.4_dmdar.txt

sthibaul commented 6 months ago

Using STARPU_REDUX leads to another but similar error

That's not precise enough for us to be able to act :) As soon as we find out the source of increased timing with 1.3.11 version, I will prepare some minimal example where CUDA map and STARPU_REDUX lead to internal StarPU assertion failures.

I mean: please provide the error message. "similar error" doesn't allow us to have any idea what this is about.

Also, again: Could you also post the output of starpu_machine_display obtained with the different versions, to make sure dmda gets the correct understanding of available PCI bandwidths, gpu placement etc.

Otherwise it's really not surprising that dm* etc. get everything wrong.

I don't have easy access to an 8-gpu machine, so I tried with simulation, and got results that actually see 1.4.4 get better result than 1.3.11 and 1.2.10... So I really need details on how things are going on the machine where you can reproduce the issue.

Also, providing us with the .starpu/sampling/bus/ and codelet/45/ files corresponding to the machine would allow me to simulate the exact same architecture, rather than simulating some 8-gpu machine I happened to have access to at some point.

Muxas commented 6 months ago

Also, again: Could you also post the output of starpu_machine_display obtained with the different versions, to make sure dmda gets the correct understanding of available PCI bandwidths, gpu placement etc.

Here are the files: starpu_machine_display_1.2.10.txt starpu_machine_display_1.3.0.txt starpu_machine_display_1.3.11.txt starpu_machine_display_1.4.4.txt

P.S. How can I help you simulate my runs? I compiled StarPU without SimGrid support. Traces by FXT weight more than 1 GB. Do not know if giving you contents of codelets/45/ or codelets/44 will help. However, here are the contents of /bus samplings.

bus_stats.tar.gz

sthibaul commented 6 months ago

Here are the files:

Ok, so you have an nvswitch, which wasn't the case of the machine I was simulated, that can explain why I wasn't seeing the problem.

How can I help you simulate my runs?

By providing the information I'm asking :)

I compiled StarPU without SimGrid support.

Simgrid is only needed for the replay part, not for the calibration part.

Traces by FXT weight more than 1 GB

We don't need traces :)

Do not know if giving you contents of codelets/45/ or codelets/44 will help

Yes, please, to be sure to have the same timings as on your machine.

sthibaul commented 6 months ago

starpu_machine_display_1.4.4.txt

there's one odd thing here compared to the others: CUDA 0 has very low bandwidth, whatever the peer. Is this reproducible when you force bus re-calibration with STARPU_BUS_CALIBRATE=1 ?

Muxas commented 6 months ago

starpu_machine_display_1.4.4.txt

I double checked. It remains the same. CUDA 0 has 11 GB/s connection to CPU, other have 13-15 GB/s. With StarPU-1.3.11 the speeds are around 25 GB/s.

StarPU-1.4.4 bandwidth

bandwidth (MB/s) and latency (us)...

from/to NUMA 0  CUDA 0  CUDA 1  CUDA 2  CUDA 3  CUDA 4  CUDA 5  CUDA 6  CUDA 7  
NUMA 0  0   11625   14683   14671   14659   14661   14598   14587   14588   
CUDA 0  11744   0   14721   14612   14629   14620   14727   14711   14707   
CUDA 1  14661   13621   0   236193  241212  241259  241637  241287  241874  
CUDA 2  14661   13722   243595  0   241024  243733  244475  244209  243717  
CUDA 3  14684   13867   244122  243544  0   241130  243455  244064  244585  
CUDA 4  14607   13908   240379  241467  246133  0   243570  243885  243641  
CUDA 5  13484   15234   241671  241864  243550  247702  0   244909  244375  
CUDA 6  13229   15071   241887  242582  244249  245052  247115  0   244958  
CUDA 7  13528   15133   241470  241637  244368  244738  244376  247771  0

StarPU-1.3.11 bandwidth

bandwidth (MB/s) and latency (us)...

from/to NUMA 0  CUDA 0  CUDA 1  CUDA 2  CUDA 3  CUDA 4  CUDA 5  CUDA 6  CUDA 7  
NUMA 0  0   25081   25169   25150   25160   25094   25097   25086   25091   
CUDA 0  23837   0   237628  245022  244849  243492  244425  244068  244064  
CUDA 1  23837   244489  0   244506  244464  244747  244650  244639  244372  
CUDA 2  23837   242112  248106  0   244695  243829  244250  245212  244295  
CUDA 3  23836   241816  243238  247892  0   244343  244691  244443  244240  
CUDA 4  23829   241359  243141  243036  247535  0   244676  244164  244173  
CUDA 5  23908   241918  241900  243932  244365  247550  0   243878  244382  
CUDA 6  23829   241531  241140  244337  244161  244080  246877  0   244022  
CUDA 7  23830   242094  241616  244295  244201  243430  243710  244042  0
Muxas commented 6 months ago

starpu_machine_display_1.4.4.txt

Looking at latencies of StarPU-1.4.4:

NUMA 0  0   0   10  9   9   10  9   9   9   
CUDA 0  0   0   10  9   9   10  9   9   9   
CUDA 1  12  12  0   14  14  14  14  14  14  
CUDA 2  12  12  14  0   14  13  13  13  13  
CUDA 3  11  12  14  13  0   13  13  13  13  
CUDA 4  12  12  14  14  13  0   14  13  13  
CUDA 5  12  12  13  13  12  12  0   12  12  
CUDA 6  12  11  13  13  13  13  12  0   12  
CUDA 7  12  11  13  13  13  13  12  12  0

StarPU thinks that CUDA 0 uses the same memory, as NUMA 0... Surprise!

sthibaul commented 6 months ago

CUDA 0 has 11 GB/s connection to CPU, other have 13-15 GB/s

Not only that, but also the gpu-gpu connexions are not getting the nvswitch speed, that's really odd.

sthibaul commented 6 months ago

StarPU thinks that CUDA 0 uses the same memory, as NUMA 0... Surprise!

The duplicates in the rows and columns and the 0 values in numa0/cuda0 are suspicious indeed.

sthibaul commented 6 months ago

It might be useful to see the config.log output in the 1.4.4 case.

sthibaul commented 6 months ago

StarPU thinks that CUDA 0 uses the same memory, as NUMA 0... Surprise!

The duplicates in the rows and columns and the 0 values in numa0/cuda0 are suspicious indeed.

and I can easily reproduce that here, good

sthibaul commented 6 months ago

(will work on it later next week, though, but at least we have a clear culprit here)

Muxas commented 6 months ago

It might be useful to see the config.log output in the 1.4.4 case.

config-StarPU-1.4.4.log

(will work on it later next week, though, but at least we have a clear culprit here)

Thank you! I will be on vacation next week, but after that I will prepare backtraces of initially described failed assertions for StarPU-1.4.4:

P.S. Enabling CUDA memory map leads to the following error: ../../src/datawizard/copy_driver.c:312: _starpu_driver_copy_data_1_to_1: Assertion `0 && "src_replicate->memory_node == dst_replicate->mapped"' failed.

P.P.S Using STARPU_REDUX leads to another but similar error. Seems like memory manager is bugged in 1.4.4 StarPU.

By the way, the version StarPU-1.3.11 gave me an error CUDA out-of-memory with STARPU_REDUX access modes. Setting STARPU_LIMIT_CUDA_MEM=60000 solved the issue. I will also try to find situation when it happened and create a backtrace.

Muxas commented 6 months ago

Compiling StarPU-1.4.4 with a flag --enable-maxnumanodes=1 solves the issue with latencies and bandwidth bringing result of STARPU_machine_display to the same of version 1.3.11. However, performance if actual computations is the same as without the flag. Amount of data transfers is still large, as reported in one of the messages above.

sthibaul commented 5 months ago

Ok, I have pushed a fix for the bandwidth/latency management to gitlab, will appear on github by tomorrow. Now that the scheduler will have the proper values, I'll investigate the large amounts.

Muxas commented 5 months ago

Ok, I have pushed a fix for the bandwidth/latency management to gitlab, will appear on github by tomorrow.

Thank you! I tried the new commit. It fixes output of starpu_machine_display, but only partially. Throughput between CPU and GPUs remains low. I mean it is around 14 GB/s, as it was with StarPU-1.4.4. The version StarPU-1.3.11 reaches 25 GB/s. Output of starpu_machine_display are attached for starpu-1.4 and starpu-1.3.11 tags.

starpu_machine_display_1.4.txt

starpu_machine_display_1.3.11.txt

sthibaul commented 5 months ago

It looks like there is an interference between the numa memory pinning and the nvidia memory pinning. I indeed see a small difference on my testbox, that might be emphasized on your box.

Muxas commented 5 months ago

Another update. This time the hardware server is different (4x Nvidia V100 SXM2). For some strange reason CUDA workers require around 500 microseconds for any (even empty) task. Setting environment variable STARPU_WORKERS_NOBIND=1 brings this time down to around 5 microseconds for an empty task (which is still large in my opinion). But it improves overall performance 2x times for my application. Attached is the starpu_machine_display for this new server (StarPU-1.3.11). Since servers with PCI-express CUDA GPUs do not suffer such a problem, I believe the problem is within hwloc-2.9.3 around Nvidia SXM bus.

starpu_machine_display_v100_starpu_1.3.11.txt

sthibaul commented 5 months ago

For some strange reason CUDA workers require around 500 microseconds for any (even empty) task. Setting environment variable STARPU_WORKERS_NOBIND=1 brings this time down to around 5 microseconds for an empty task

It seems that thread binding got broken in the 1.3 series indeed. I backported some fixes from 1.4, which should fix it (by looking at the pci bus numbers in your v100 case the gpus should be driven from numa0, not 1)

5 microseconds for an empty task (which is still large in my opinion)

The CUDA cost itself is already that order of magnitude, unfortunately.

Since servers with PCI-express CUDA GPUs do not suffer such a problem

They probably have the same binding issue, just with much lower overhead probably.

sthibaul commented 5 months ago

It looks like there is an interference between the numa memory pinning and the nvidia memory pinning. I indeed see a small difference on my testbox, that might be emphasized on your box.

Ok, there was a typo in starpu-1.3 which didn't pose problem there, but ended up posing problem to 1.4, thus why it went unnoticed. This should now be fixed by Fix missing pinning memory when benchmarking bus with numa in 1.4 (and not broken on 1.3), so the bandwidth numbers should now be fine, could you please check?

Then I'll check the scheduling part

Muxas commented 5 months ago

I tried new commit in the starpu-1.3 branch and it got even worse, just like with starpu-1.4.4 case. Take a look at starpu_1.3.11_v100_machine_display.txt

sthibaul commented 5 months ago

Ok, with the update it gained the need for the same fix as in 1.4 (Fix bus performance models selection), should now be fixed

Muxas commented 5 months ago

New output of starpu_machine_display:

StarPU has found :
32 STARPU_CPU_WORKER workers:
    CPU 0
    CPU 1
    CPU 2
    CPU 3
    CPU 4
    CPU 5
    CPU 6
    CPU 7
    CPU 8
    CPU 9
    CPU 10
    CPU 11
    CPU 12
    CPU 13
    CPU 14
    CPU 15
    CPU 16
    CPU 17
    CPU 18
    CPU 19
    CPU 20
    CPU 21
    CPU 22
    CPU 23
    CPU 24
    CPU 25
    CPU 26
    CPU 27
    CPU 28
    CPU 29
    CPU 30
    CPU 31
4 STARPU_CUDA_WORKER workers:
    CUDA 0.0 (Tesla V100-SXM2-16GB 14.2 GiB 1a:00.0)
    CUDA 1.0 (Tesla V100-SXM2-16GB 14.2 GiB 1c:00.0)
    CUDA 2.0 (Tesla V100-SXM2-16GB 14.2 GiB 1d:00.0)
    CUDA 3.0 (Tesla V100-SXM2-16GB 14.2 GiB 1e:00.0)
No STARPU_OPENCL_WORKER worker

topology ... (hwloc logical indexes)
numa  0 pack  0 core 0  PU 0    CUDA 1.0 (Tesla V100-SXM2-16GB 14.2 GiB 1c:00.0)    
        core 1  PU 1    CUDA 2.0 (Tesla V100-SXM2-16GB 14.2 GiB 1d:00.0)    
        core 2  PU 2    CUDA 3.0 (Tesla V100-SXM2-16GB 14.2 GiB 1e:00.0)    
        core 3  PU 3    CPU 0   
        core 4  PU 4    CPU 1   
        core 5  PU 5    CPU 2   
        core 6  PU 6    CPU 3   
        core 7  PU 7    CPU 4   
        core 8  PU 8    CPU 5   
        core 9  PU 9    CPU 6   
        core 10 PU 10   CPU 7   
        core 11 PU 11   CPU 8   
        core 12 PU 12   CPU 9   
        core 13 PU 13   CPU 10  
        core 14 PU 14   CPU 11  
        core 15 PU 15   CPU 12  
        core 16 PU 16   CPU 13  
        core 17 PU 17   CPU 14  
numa  1 pack  1 core 18 PU 18   CUDA 0.0 (Tesla V100-SXM2-16GB 14.2 GiB 1a:00.0)    
        core 19 PU 19   CPU 15  
        core 20 PU 20   CPU 16  
        core 21 PU 21   CPU 17  
        core 22 PU 22   CPU 18  
        core 23 PU 23   CPU 19  
        core 24 PU 24   CPU 20  
        core 25 PU 25   CPU 21  
        core 26 PU 26   CPU 22  
        core 27 PU 27   CPU 23  
        core 28 PU 28   CPU 24  
        core 29 PU 29   CPU 25  
        core 30 PU 30   CPU 26  
        core 31 PU 31   CPU 27  
        core 32 PU 32   CPU 28  
        core 33 PU 33   CPU 29  
        core 34 PU 34   CPU 30  
        core 35 PU 35   CPU 31  

bandwidth (MB/s) and latency (us)...
from/to NUMA 0  CUDA 0  CUDA 1  CUDA 2  CUDA 3  
NUMA 0  0   12309   12329   12334   12331   
CUDA 0  13092   0   47517   47695   47688   
CUDA 1  13101   47693   0   47699   47704   
CUDA 2  13102   47691   47689   0   47689   
CUDA 3  13101   47694   47698   47694   0   

NUMA 0  0   9   9   9   9   
CUDA 0  9   0   11  11  11  
CUDA 1  9   11  0   11  11  
CUDA 2  9   11  11  0   11  
CUDA 3  9   11  11  11  0   

GPU NUMA in preference order (logical index), host-to-device, device-to-host
CUDA_0   1   0  
CUDA_1   0   1  
CUDA_2   0   1  
CUDA_3   0   1
Muxas commented 5 months ago

Seems like CUDA 0.0 on NUMA 1 is kind of strange

Muxas commented 5 months ago

I tried recompiling starpu-1.3 and starpu-1.4 versions and found, that performance of my app on 8xNvidia A100 with starpu-1.3 has increased, but performance of my app with starpu-1.4 is still very very low (10 times difference). Still, there is some trouble with starpu-1.4 tag.

Output of starpu_nachine_display seem to be correct for the starpu-1.4 tag after all your fixes. starpu_machine_display_1.4.txt

However, dmdar scheduler sends different amounts of data (STARPU_BUS_STATS=1 output): starpu-1.3: Total transfers: 30162.5039 GB starpu-1.4: Total transfers: 56256.7500 GB

These additional transfers of starpu-1.4 overwhelm all the computations. How can I help you dig into this issue?

Muxas commented 5 months ago

StarPU-1.3 bus stats:

#---------------------
Data transfer stats:
    NUMA 0 -> CUDA 0    24.4528 GB  84.9925 MB/s    (transfers : 552 - avg 45.3616 MB)
    CUDA 0 -> NUMA 0    138.6356 GB 481.8670 MB/s   (transfers : 3052 - avg 46.5147 MB)
    NUMA 0 -> CUDA 1    39.5344 GB  137.4128 MB/s   (transfers : 970 - avg 41.7353 MB)
    CUDA 1 -> NUMA 0    113.3956 GB 394.1384 MB/s   (transfers : 2347 - avg 49.4747 MB)
    CUDA 0 -> CUDA 1    324.7757 GB 1128.8490 MB/s  (transfers : 6451 - avg 51.5533 MB)
    CUDA 1 -> CUDA 0    555.4949 GB 1930.7786 MB/s  (transfers : 9414 - avg 60.4235 MB)
    NUMA 0 -> CUDA 2    40.8971 GB  142.1495 MB/s   (transfers : 738 - avg 56.7462 MB)
    CUDA 2 -> NUMA 0    135.2596 GB 470.1326 MB/s   (transfers : 2260 - avg 61.2858 MB)
    CUDA 0 -> CUDA 2    450.5016 GB 1565.8445 MB/s  (transfers : 7163 - avg 64.4023 MB)
    CUDA 2 -> CUDA 0    467.1385 GB 1623.6707 MB/s  (transfers : 7189 - avg 66.5391 MB)
    CUDA 1 -> CUDA 2    471.9199 GB 1640.2899 MB/s  (transfers : 9051 - avg 53.3914 MB)
    CUDA 2 -> CUDA 1    700.8458 GB 2435.9860 MB/s  (transfers : 11589 - avg 61.9265 MB)
    NUMA 0 -> CUDA 3    48.2185 GB  167.5969 MB/s   (transfers : 726 - avg 68.0107 MB)
    CUDA 3 -> NUMA 0    202.8596 GB 705.0954 MB/s   (transfers : 3059 - avg 67.9072 MB)
    CUDA 0 -> CUDA 3    395.4815 GB 1374.6067 MB/s  (transfers : 6455 - avg 62.7379 MB)
    CUDA 3 -> CUDA 0    486.9690 GB 1692.5970 MB/s  (transfers : 7590 - avg 65.6991 MB)
    CUDA 1 -> CUDA 3    524.3406 GB 1822.4922 MB/s  (transfers : 9740 - avg 55.1257 MB)
    CUDA 3 -> CUDA 1    353.8773 GB 1229.9993 MB/s  (transfers : 6892 - avg 52.5784 MB)
    CUDA 2 -> CUDA 3    337.0722 GB 1171.5887 MB/s  (transfers : 4886 - avg 70.6431 MB)
    CUDA 3 -> CUDA 2    724.7443 GB 2519.0510 MB/s  (transfers : 10971 - avg 67.6454 MB)
    NUMA 0 -> CUDA 4    39.5301 GB  137.3977 MB/s   (transfers : 704 - avg 57.4983 MB)
    CUDA 4 -> NUMA 0    51.4072 GB  178.6799 MB/s   (transfers : 1050 - avg 50.1342 MB)
    CUDA 0 -> CUDA 4    378.5577 GB 1315.7827 MB/s  (transfers : 6424 - avg 60.3429 MB)
    CUDA 4 -> CUDA 0    313.5901 GB 1089.9698 MB/s  (transfers : 5567 - avg 57.6821 MB)
    CUDA 1 -> CUDA 4    484.5900 GB 1684.3274 MB/s  (transfers : 9445 - avg 52.5379 MB)
    CUDA 4 -> CUDA 1    296.4626 GB 1030.4381 MB/s  (transfers : 6054 - avg 50.1450 MB)
    CUDA 2 -> CUDA 4    525.8330 GB 1827.6790 MB/s  (transfers : 7021 - avg 76.6918 MB)
    CUDA 4 -> CUDA 2    307.7303 GB 1069.6022 MB/s  (transfers : 5506 - avg 57.2313 MB)
    CUDA 3 -> CUDA 4    522.4628 GB 1815.9648 MB/s  (transfers : 7565 - avg 70.7207 MB)
    CUDA 4 -> CUDA 3    709.5829 GB 2466.3525 MB/s  (transfers : 10193 - avg 71.2855 MB)
    NUMA 0 -> CUDA 5    36.7464 GB  127.7223 MB/s   (transfers : 623 - avg 60.3986 MB)
    CUDA 5 -> NUMA 0    133.4323 GB 463.7810 MB/s   (transfers : 2392 - avg 57.1215 MB)
    CUDA 0 -> CUDA 5    492.1053 GB 1710.4486 MB/s  (transfers : 7620 - avg 66.1307 MB)
    CUDA 5 -> CUDA 0    447.4872 GB 1555.3660 MB/s  (transfers : 6856 - avg 66.8359 MB)
    CUDA 1 -> CUDA 5    716.0729 GB 2488.9098 MB/s  (transfers : 10350 - avg 70.8462 MB)
    CUDA 5 -> CUDA 1    402.6089 GB 1399.3788 MB/s  (transfers : 7345 - avg 56.1295 MB)
    CUDA 2 -> CUDA 5    459.1772 GB 1595.9974 MB/s  (transfers : 7275 - avg 64.6319 MB)
    CUDA 5 -> CUDA 2    512.6324 GB 1781.7958 MB/s  (transfers : 7877 - avg 66.6416 MB)
    CUDA 3 -> CUDA 5    559.6148 GB 1945.0961 MB/s  (transfers : 7573 - avg 75.6696 MB)
    CUDA 5 -> CUDA 3    551.3154 GB 1916.2491 MB/s  (transfers : 8004 - avg 70.5331 MB)
    CUDA 4 -> CUDA 5    403.6525 GB 1403.0059 MB/s  (transfers : 7141 - avg 57.8827 MB)
    CUDA 5 -> CUDA 4    667.2792 GB 2319.3132 MB/s  (transfers : 10388 - avg 65.7772 MB)
    NUMA 0 -> CUDA 6    36.5536 GB  127.0522 MB/s   (transfers : 550 - avg 68.0562 MB)
    CUDA 6 -> NUMA 0    126.1244 GB 438.3802 MB/s   (transfers : 2126 - avg 60.7485 MB)
    CUDA 0 -> CUDA 6    499.6056 GB 1736.5172 MB/s  (transfers : 7533 - avg 67.9140 MB)
    CUDA 6 -> CUDA 0    503.7355 GB 1750.8716 MB/s  (transfers : 7365 - avg 70.0374 MB)
    CUDA 1 -> CUDA 6    536.4423 GB 1864.5532 MB/s  (transfers : 8588 - avg 63.9633 MB)
    CUDA 6 -> CUDA 1    429.3033 GB 1492.1621 MB/s  (transfers : 6934 - avg 63.3987 MB)
    CUDA 2 -> CUDA 6    534.6896 GB 1858.4609 MB/s  (transfers : 8164 - avg 67.0654 MB)
    CUDA 6 -> CUDA 2    343.1891 GB 1192.8483 MB/s  (transfers : 5441 - avg 64.5884 MB)
    CUDA 3 -> CUDA 6    431.0230 GB 1498.1391 MB/s  (transfers : 5536 - avg 79.7268 MB)
    CUDA 6 -> CUDA 3    535.2645 GB 1860.4591 MB/s  (transfers : 6578 - avg 83.3249 MB)
    CUDA 4 -> CUDA 6    397.4898 GB 1381.5853 MB/s  (transfers : 5752 - avg 70.7631 MB)
    CUDA 6 -> CUDA 4    347.6010 GB 1208.1827 MB/s  (transfers : 5720 - avg 62.2279 MB)
    CUDA 5 -> CUDA 6    571.9941 GB 1988.1228 MB/s  (transfers : 7757 - avg 75.5088 MB)
    CUDA 6 -> CUDA 5    773.4525 GB 2688.3467 MB/s  (transfers : 10802 - avg 73.3212 MB)
    NUMA 0 -> CUDA 7    32.4117 GB  112.6559 MB/s   (transfers : 716 - avg 46.3542 MB)
    CUDA 7 -> NUMA 0    101.7807 GB 353.7668 MB/s   (transfers : 2028 - avg 51.3922 MB)
    CUDA 0 -> CUDA 7    565.1070 GB 1964.1844 MB/s  (transfers : 9144 - avg 63.2841 MB)
    CUDA 7 -> CUDA 0    539.1749 GB 1874.0503 MB/s  (transfers : 8681 - avg 63.6004 MB)
    CUDA 1 -> CUDA 7    665.1335 GB 2311.8539 MB/s  (transfers : 12870 - avg 52.9213 MB)
    CUDA 7 -> CUDA 1    541.7370 GB 1882.9554 MB/s  (transfers : 10455 - avg 53.0597 MB)
    CUDA 2 -> CUDA 7    705.2172 GB 2451.1754 MB/s  (transfers : 11433 - avg 63.1630 MB)
    CUDA 7 -> CUDA 2    652.1891 GB 2266.8617 MB/s  (transfers : 10446 - avg 63.9328 MB)
    CUDA 3 -> CUDA 7    704.0863 GB 2447.2446 MB/s  (transfers : 11128 - avg 64.7901 MB)
    CUDA 7 -> CUDA 3    573.6588 GB 1993.9081 MB/s  (transfers : 8631 - avg 68.0601 MB)
    CUDA 4 -> CUDA 7    600.3521 GB 2086.6877 MB/s  (transfers : 10220 - avg 60.1527 MB)
    CUDA 7 -> CUDA 4    494.1706 GB 1717.6250 MB/s  (transfers : 8540 - avg 59.2542 MB)
    CUDA 5 -> CUDA 7    354.7237 GB 1232.9392 MB/s  (transfers : 6475 - avg 56.0984 MB)
    CUDA 7 -> CUDA 5    547.3857 GB 1902.5886 MB/s  (transfers : 8801 - avg 63.6886 MB)
    CUDA 6 -> CUDA 7    639.4545 GB 2222.5987 MB/s  (transfers : 9652 - avg 67.8410 MB)
    CUDA 7 -> CUDA 6    831.1680 GB 2888.9510 MB/s  (transfers : 12008 - avg 70.8791 MB)
Total transfers: 30162.5039 GB

StarPU-1.4 bus stats:

#---------------------
Data transfer stats:
    NUMA 0 -> CUDA 0    1338.7977 GB    823.5640 MB/s   (transfers : 20459 - avg 67.0086 MB)
    CUDA 0 -> NUMA 0    953.6840 GB 586.6606 MB/s   (transfers : 16224 - avg 60.1931 MB)
    NUMA 0 -> CUDA 1    2220.5159 GB    1365.9547 MB/s  (transfers : 27024 - avg 84.1403 MB)
    CUDA 1 -> NUMA 0    680.4276 GB 418.5663 MB/s   (transfers : 10676 - avg 65.2639 MB)
    CUDA 0 -> CUDA 1    384.6295 GB 236.6056 MB/s   (transfers : 5589 - avg 70.4707 MB)
    CUDA 1 -> CUDA 0    632.2552 GB 388.9330 MB/s   (transfers : 9657 - avg 67.0425 MB)
    NUMA 0 -> CUDA 2    1974.4674 GB    1214.5974 MB/s  (transfers : 24295 - avg 83.2210 MB)
    CUDA 2 -> NUMA 0    736.1757 GB 452.8599 MB/s   (transfers : 11703 - avg 64.4146 MB)
    CUDA 0 -> CUDA 2    464.1732 GB 285.5370 MB/s   (transfers : 7218 - avg 65.8511 MB)
    CUDA 2 -> CUDA 0    460.1991 GB 283.0924 MB/s   (transfers : 6748 - avg 69.8346 MB)
    CUDA 1 -> CUDA 2    378.5329 GB 232.8552 MB/s   (transfers : 5487 - avg 70.6429 MB)
    CUDA 2 -> CUDA 1    847.5342 GB 521.3623 MB/s   (transfers : 10883 - avg 79.7459 MB)
    NUMA 0 -> CUDA 3    2348.7388 GB    1444.8311 MB/s  (transfers : 29448 - avg 81.6731 MB)
    CUDA 3 -> NUMA 0    713.3685 GB 438.8300 MB/s   (transfers : 10214 - avg 71.5184 MB)
    CUDA 0 -> CUDA 3    558.1124 GB 343.3239 MB/s   (transfers : 8283 - avg 68.9976 MB)
    CUDA 3 -> CUDA 0    480.9552 GB 295.8605 MB/s   (transfers : 6561 - avg 75.0645 MB)
    CUDA 1 -> CUDA 3    386.2551 GB 237.6056 MB/s   (transfers : 5257 - avg 75.2378 MB)
    CUDA 3 -> CUDA 1    508.9437 GB 313.0777 MB/s   (transfers : 6956 - avg 74.9221 MB)
    CUDA 2 -> CUDA 3    400.0684 GB 246.1028 MB/s   (transfers : 7030 - avg 58.2745 MB)
    CUDA 3 -> CUDA 2    789.4544 GB 485.6344 MB/s   (transfers : 10949 - avg 73.8333 MB)
    NUMA 0 -> CUDA 4    2626.2900 GB    1615.5672 MB/s  (transfers : 31490 - avg 85.4024 MB)
    CUDA 4 -> NUMA 0    750.1584 GB 461.4614 MB/s   (transfers : 8682 - avg 88.4776 MB)
    CUDA 0 -> CUDA 4    658.1709 GB 404.8751 MB/s   (transfers : 9663 - avg 69.7472 MB)
    CUDA 4 -> CUDA 0    618.8331 GB 380.6763 MB/s   (transfers : 8840 - avg 71.6838 MB)
    CUDA 1 -> CUDA 4    418.9629 GB 257.7258 MB/s   (transfers : 6602 - avg 64.9830 MB)
    CUDA 4 -> CUDA 1    566.4614 GB 348.4598 MB/s   (transfers : 8225 - avg 70.5236 MB)
    CUDA 2 -> CUDA 4    494.4859 GB 304.1839 MB/s   (transfers : 8537 - avg 59.3128 MB)
    CUDA 4 -> CUDA 2    445.8077 GB 274.2394 MB/s   (transfers : 6456 - avg 70.7105 MB)
    CUDA 3 -> CUDA 4    460.8046 GB 283.4648 MB/s   (transfers : 7233 - avg 65.2376 MB)
    CUDA 4 -> CUDA 3    943.5418 GB 580.4215 MB/s   (transfers : 13386 - avg 72.1789 MB)
    NUMA 0 -> CUDA 5    122.0536 GB 75.0815 MB/s    (transfers : 10087 - avg 12.3905 MB)
    CUDA 5 -> NUMA 0    1185.4202 GB    729.2134 MB/s   (transfers : 19643 - avg 61.7966 MB)
    CUDA 0 -> CUDA 5    968.1405 GB 595.5534 MB/s   (transfers : 16301 - avg 60.8169 MB)
    CUDA 5 -> CUDA 0    523.2678 GB 321.8892 MB/s   (transfers : 9131 - avg 58.6821 MB)
    CUDA 1 -> CUDA 5    842.0186 GB 517.9693 MB/s   (transfers : 13309 - avg 64.7853 MB)
    CUDA 5 -> CUDA 1    472.1972 GB 290.4729 MB/s   (transfers : 7634 - avg 63.3390 MB)
    CUDA 2 -> CUDA 5    698.3336 GB 429.5812 MB/s   (transfers : 13098 - avg 54.5956 MB)
    CUDA 5 -> CUDA 2    496.5089 GB 305.4283 MB/s   (transfers : 8367 - avg 60.7655 MB)
    CUDA 3 -> CUDA 5    680.8994 GB 418.8565 MB/s   (transfers : 10590 - avg 65.8396 MB)
    CUDA 5 -> CUDA 3    735.2136 GB 452.2679 MB/s   (transfers : 11432 - avg 65.8554 MB)
    CUDA 4 -> CUDA 5    881.4655 GB 542.2351 MB/s   (transfers : 13900 - avg 64.9367 MB)
    CUDA 5 -> CUDA 4    1018.6028 GB    626.5954 MB/s   (transfers : 15072 - avg 69.2044 MB)
    NUMA 0 -> CUDA 6    1623.2544 GB    998.5478 MB/s   (transfers : 23629 - avg 70.3463 MB)
    CUDA 6 -> NUMA 0    935.3520 GB 575.3834 MB/s   (transfers : 13628 - avg 70.2818 MB)
    CUDA 0 -> CUDA 6    585.4160 GB 360.1197 MB/s   (transfers : 9800 - avg 61.1700 MB)
    CUDA 6 -> CUDA 0    512.5333 GB 315.2857 MB/s   (transfers : 7867 - avg 66.7134 MB)
    CUDA 1 -> CUDA 6    705.4360 GB 433.9502 MB/s   (transfers : 11261 - avg 64.1476 MB)
    CUDA 6 -> CUDA 1    596.2983 GB 366.8140 MB/s   (transfers : 8717 - avg 70.0481 MB)
    CUDA 2 -> CUDA 6    533.6901 GB 328.3004 MB/s   (transfers : 7509 - avg 72.7791 MB)
    CUDA 6 -> CUDA 2    576.7213 GB 354.7711 MB/s   (transfers : 8479 - avg 69.6500 MB)
    CUDA 3 -> CUDA 6    604.8471 GB 372.0727 MB/s   (transfers : 8328 - avg 74.3712 MB)
    CUDA 6 -> CUDA 3    517.1010 GB 318.0955 MB/s   (transfers : 7923 - avg 66.8322 MB)
    CUDA 4 -> CUDA 6    757.3788 GB 465.9029 MB/s   (transfers : 11494 - avg 67.4749 MB)
    CUDA 6 -> CUDA 4    561.9124 GB 345.6614 MB/s   (transfers : 8608 - avg 66.8446 MB)
    CUDA 5 -> CUDA 6    499.9295 GB 307.5325 MB/s   (transfers : 9079 - avg 56.3859 MB)
    CUDA 6 -> CUDA 5    1239.6105 GB    762.5485 MB/s   (transfers : 19311 - avg 65.7325 MB)
    NUMA 0 -> CUDA 7    1688.1376 GB    1038.4607 MB/s  (transfers : 24379 - avg 70.9075 MB)
    CUDA 7 -> NUMA 0    1012.5004 GB    622.8414 MB/s   (transfers : 17842 - avg 58.1101 MB)
    CUDA 0 -> CUDA 7    642.5500 GB 395.2658 MB/s   (transfers : 11577 - avg 56.8343 MB)
    CUDA 7 -> CUDA 0    473.7164 GB 291.4075 MB/s   (transfers : 8139 - avg 59.6002 MB)
    CUDA 1 -> CUDA 7    616.4796 GB 379.2284 MB/s   (transfers : 10562 - avg 59.7685 MB)
    CUDA 7 -> CUDA 1    514.6750 GB 316.6032 MB/s   (transfers : 7356 - avg 71.6459 MB)
    CUDA 2 -> CUDA 7    636.5130 GB 391.5520 MB/s   (transfers : 9612 - avg 67.8100 MB)
    CUDA 7 -> CUDA 2    732.0572 GB 450.3262 MB/s   (transfers : 10375 - avg 72.2532 MB)
    CUDA 3 -> CUDA 7    615.6310 GB 378.7064 MB/s   (transfers : 9228 - avg 68.3145 MB)
    CUDA 7 -> CUDA 3    608.7626 GB 374.4814 MB/s   (transfers : 9402 - avg 66.3022 MB)
    CUDA 4 -> CUDA 7    948.1803 GB 583.2747 MB/s   (transfers : 14035 - avg 69.1797 MB)
    CUDA 7 -> CUDA 4    744.2115 GB 457.8030 MB/s   (transfers : 11214 - avg 67.9573 MB)
    CUDA 5 -> CUDA 7    657.0583 GB 404.1905 MB/s   (transfers : 11256 - avg 59.7750 MB)
    CUDA 7 -> CUDA 5    880.2009 GB 541.4570 MB/s   (transfers : 14233 - avg 63.3265 MB)
    CUDA 6 -> CUDA 7    674.8632 GB 415.1432 MB/s   (transfers : 10820 - avg 63.8687 MB)
    CUDA 7 -> CUDA 6    696.8048 GB 428.6406 MB/s   (transfers : 10635 - avg 67.0924 MB)
Total transfers: 56256.7500 GB

One can easily see starpu-1.4 sends much more data between CPU-GPU through PCI-express bus. Seems like communications through PCI-e bus limit performance.

sthibaul commented 5 months ago

How can I help you dig into this issue?

I still need to take the time to reproduce the case in simulation. Perhaps you can post the updated sampling/bus/ directory with all the now-proper bandwidths?

Muxas commented 5 months ago

Perhaps you can post the updated sampling/bus/ directory with all the now-proper bandwidths?

Sure, here it is: bus-starpu-1.4.tar.gz

Muxas commented 5 months ago

Another update (sorry for spam): setting STARPU_NCUDA=4 for starpu-1.4 solves the problem. I am getting half of starpu-1.3 performance with 8x Nvidia A100 gpus. I believe reason for bad performance is the NVSwitch. Bandwidth between two GPUs, measured when no other GPU is transferring data, is much higher compared to a situation when all 8 GPUs transfer data to each other. For example: GPU-GPU bandwidth measured by starpu_machine_display shows around 250GB/s, but Nvidia A100 data sheet states 600GB/s. I believe 600GB/s is a maximal throughput from a single GPU to other GPUs. In a case of 8x Nvidia A100 we get less than 100GB/s instead of reported 250GB/s.

Muxas commented 5 months ago

Hi again! Just tested updated starpu-1.3 branch (commit ed1956801806d6bf51bbf859f0908872b902ec04 of GitHub repo) on a server with 4x Nvidia V100. Here is the starpu_nachine_display output:

StarPU has found :
4 STARPU_CPU_WORKER workers:
    CPU 0
    CPU 1
    CPU 2
    CPU 3
4 STARPU_CUDA_WORKER workers:
    CUDA 0.0 (Tesla V100-SXM2-16GB 14.2 GiB 1a:00.0)
    CUDA 1.0 (Tesla V100-SXM2-16GB 14.2 GiB 1c:00.0)
    CUDA 2.0 (Tesla V100-SXM2-16GB 14.2 GiB 1d:00.0)
    CUDA 3.0 (Tesla V100-SXM2-16GB 14.2 GiB 1e:00.0)
No STARPU_OPENCL_WORKER worker

topology ... (hwloc logical indexes)
numa  0 pack  0 core 0  PU 0    CUDA 1.0 (Tesla V100-SXM2-16GB 14.2 GiB 1c:00.0)    
        core 1  PU 1    CUDA 2.0 (Tesla V100-SXM2-16GB 14.2 GiB 1d:00.0)    
        core 2  PU 2    CUDA 3.0 (Tesla V100-SXM2-16GB 14.2 GiB 1e:00.0)    
        core 3  PU 3    CPU 0   
        core 4  PU 4    CPU 1   
        core 5  PU 5    CPU 2   
        core 6  PU 6    CPU 3   
        core 7  PU 7    
        core 8  PU 8    
        core 9  PU 9    
        core 10 PU 10   
        core 11 PU 11   
        core 12 PU 12   
        core 13 PU 13   
        core 14 PU 14   
        core 15 PU 15   
        core 16 PU 16   
        core 17 PU 17   
numa  1 pack  1 core 18 PU 18   CUDA 0.0 (Tesla V100-SXM2-16GB 14.2 GiB 1a:00.0)    
        core 19 PU 19   
        core 20 PU 20   
        core 21 PU 21   
        core 22 PU 22   
        core 23 PU 23   
        core 24 PU 24   
        core 25 PU 25   
        core 26 PU 26   
        core 27 PU 27   
        core 28 PU 28   
        core 29 PU 29   
        core 30 PU 30   
        core 31 PU 31   
        core 32 PU 32   
        core 33 PU 33   
        core 34 PU 34   
        core 35 PU 35   

bandwidth (MB/s) and latency (us)...
from/to NUMA 0  CUDA 0  CUDA 1  CUDA 2  CUDA 3  
NUMA 0  0   12309   12329   12332   12332   
CUDA 0  13103   0   47507   47698   47698   
CUDA 1  13102   47687   0   47705   47696   
CUDA 2  13101   47691   47698   0   47688   
CUDA 3  13102   47704   47700   47684   0   

NUMA 0  0   9   9   9   9   
CUDA 0  9   0   11  11  11  
CUDA 1  9   11  0   11  11  
CUDA 2  8   11  11  0   11  
CUDA 3  9   11  11  11  0   

GPU NUMA in preference order (logical index), host-to-device, device-to-host
CUDA_0   0 12309 13103   1 12327 13098  
CUDA_1   0 12329 13102   1 12325 13099  
CUDA_2   0 12332 13101   1 12327 13098  
CUDA_3   0 12332 13102   1 12329 13098

CUDA 0 is on NUMA 1 for some reason again.

Muxas commented 5 months ago

Tried commit https://github.com/starpu-runtime/starpu/commit/ed1956801806d6bf51bbf859f0908872b902ec04 of GitHub repo on a server with 4x V100 and got upsetting surprise for my application: 1) starpu-1.3.11 with STARPU_WORKERS_NOBIND=1 performs with 25 Tflops/s. 2) starpu-1.3.11 with STARPU_WORKERS_NOBIND=0 performs with 12 Tflops/s . 3) starpu-1.3 performs with 1.8 Tflops/s.

Everything seems to be the same, only version of StarPU is different.

Muxas commented 5 months ago

Here is performance model file for cublasSgemm for the latest commit of starpu-1.3 branch:

##################
# Performance Model Version
45

####################
# COMBs
# number of combinations
4
####################
# COMB_3
# number of types devices
1
####################
# DEV_0
# device type (CPU - 0, CUDA - 1, OPENCL - 2, MIC - 3, MPI_MS - 5)
1
####################
# DEV_0
# device id 
0
####################
# DEV_0
# number of cores 
1
##########
# number of implementations
1
#####
# Model for cuda0_impl0 (Comb3)
# number of entries
2
# sumlnx    sumlnx2     sumlny      sumlnxlny   alpha       beta        n   minx        maxx
0.000000e+00    0.000000e+00    0.000000e+00    0.000000e+00    nan             nan             0   0               0              
# a     b       c
nan             nan             nan            
# not multiple-regression-base
0
# hash      size        flops       mean (us)   dev (us)    sum     sum2        n
6cac4676    423624704       1.073742e+11    7.524503e+03    8.031120e+02    1.023332e+07    7.787787e+10    1360
9a5c4e6a    12582912        2.147484e+09    4.700963e+02    2.476145e+01    5.017902e+07    2.365442e+10    106742

####################
# COMB_1
# number of types devices
1
####################
# DEV_0
# device type (CPU - 0, CUDA - 1, OPENCL - 2, MIC - 3, MPI_MS - 5)
1
####################
# DEV_0
# device id 
1
####################
# DEV_0
# number of cores 
1
##########
# number of implementations
1
#####
# Model for cuda1_impl0 (Comb1)
# number of entries
2
# sumlnx    sumlnx2     sumlny      sumlnxlny   alpha       beta        n   minx        maxx
0.000000e+00    0.000000e+00    0.000000e+00    0.000000e+00    nan             nan             0   0               0              
# a     b       c
nan             nan             nan            
# not multiple-regression-base
0
# hash      size        flops       mean (us)   dev (us)    sum     sum2        n
6cac4676    423624704       1.073742e+11    9.384886e+03    1.617014e+03    1.745589e+06    1.686849e+10    186
9a5c4e6a    12582912        2.147484e+09    2.045799e+03    4.604698e+02    1.378459e+07    2.962918e+10    6738

####################
# COMB_2
# number of types devices
1
####################
# DEV_0
# device type (CPU - 0, CUDA - 1, OPENCL - 2, MIC - 3, MPI_MS - 5)
1
####################
# DEV_0
# device id 
2
####################
# DEV_0
# number of cores 
1
##########
# number of implementations
1
#####
# Model for cuda2_impl0 (Comb2)
# number of entries
2
# sumlnx    sumlnx2     sumlny      sumlnxlny   alpha       beta        n   minx        maxx
0.000000e+00    0.000000e+00    0.000000e+00    0.000000e+00    nan             nan             0   0               0              
# a     b       c
nan             nan             nan            
# not multiple-regression-base
0
# hash      size        flops       mean (us)   dev (us)    sum     sum2        n
6cac4676    423624704       1.073742e+11    1.011996e+04    1.480795e+03    1.629313e+06    1.684161e+10    161
9a5c4e6a    12582912        2.147484e+09    2.054684e+03    4.624059e+02    1.328148e+07    2.867137e+10    6464

####################
# COMB_0
# number of types devices
1
####################
# DEV_0
# device type (CPU - 0, CUDA - 1, OPENCL - 2, MIC - 3, MPI_MS - 5)
1
####################
# DEV_0
# device id 
3
####################
# DEV_0
# number of cores 
1
##########
# number of implementations
1
#####
# Model for cuda3_impl0 (Comb0)
# number of entries
2
# sumlnx    sumlnx2     sumlny      sumlnxlny   alpha       beta        n   minx        maxx
0.000000e+00    0.000000e+00    0.000000e+00    0.000000e+00    nan             nan             0   0               0              
# a     b       c
nan             nan             nan            
# not multiple-regression-base
0
# hash      size        flops       mean (us)   dev (us)    sum     sum2        n
6cac4676    423624704       1.073742e+11    9.142048e+03    1.613087e+03    3.144865e+06    2.964561e+10    344
9a5c4e6a    12582912        2.147484e+09    2.102459e+03    4.772954e+02    1.144789e+07    2.530914e+10    5445

Gemm performance of cuBLAS is around 5 Tflops/s for CUDA 0 and around 1 Tflops/s for other CUDA devices. It shall be around 14 Tflops/s. For reference, performance model for STARPU_NCUDA=1 at starpu-1.3.11:

##################
# Performance Model Version
45

####################
# COMBs
# number of combinations
1
####################
# COMB_0
# number of types devices
1
####################
# DEV_0
# device type (CPU - 0, CUDA - 1, OPENCL - 2, MIC - 3, MPI_MS - 5)
1
####################
# DEV_0
# device id 
0
####################
# DEV_0
# number of cores 
1
##########
# number of implementations
1
#####
# Model for cuda0_impl0 (Comb0)
# number of entries
3
# sumlnx    sumlnx2     sumlny      sumlnxlny   alpha       beta        n   minx        maxx
0.000000e+00    0.000000e+00    0.000000e+00    0.000000e+00    nan             nan             0   0               0              
# a     b       c
nan             nan             nan            
# not multiple-regression-base
0
# hash      size        flops       mean (us)   dev (us)    sum     sum2        n
d295bde2    1065353216      4.294967e+11    2.963925e+04    8.908781e+02    2.934285e+06    8.704859e+10    99
ca98c721    100663296       3.435974e+10    2.801510e+03    4.595160e+02    1.834989e+06    5.279048e+09    655
2465b5fe    37748736        8.589935e+09    5.997161e+02    3.460896e+01    1.583850e+06    9.530240e+08    2641
Muxas commented 5 months ago

Sorry for another round of messages. I tried recompiling starpu-1.3 tag from a scratch just to double check. This time output of starpu_machine_display is much better:

StarPU has found :
32 STARPU_CPU_WORKER workers:
    CPU 0
    CPU 1
    CPU 2
    CPU 3
    CPU 4
    CPU 5
    CPU 6
    CPU 7
    CPU 8
    CPU 9
    CPU 10
    CPU 11
    CPU 12
    CPU 13
    CPU 14
    CPU 15
    CPU 16
    CPU 17
    CPU 18
    CPU 19
    CPU 20
    CPU 21
    CPU 22
    CPU 23
    CPU 24
    CPU 25
    CPU 26
    CPU 27
    CPU 28
    CPU 29
    CPU 30
    CPU 31
4 STARPU_CUDA_WORKER workers:
    CUDA 0.0 (Tesla V100-SXM2-16GB 14.2 GiB 1a:00.0)
    CUDA 1.0 (Tesla V100-SXM2-16GB 14.2 GiB 1c:00.0)
    CUDA 2.0 (Tesla V100-SXM2-16GB 14.2 GiB 1d:00.0)
    CUDA 3.0 (Tesla V100-SXM2-16GB 14.2 GiB 1e:00.0)
No STARPU_OPENCL_WORKER worker

topology ... (hwloc logical indexes)
numa  0 pack  0 core 0  PU 0    CUDA 1.0 (Tesla V100-SXM2-16GB 14.2 GiB 1c:00.0)    
        core 1  PU 1    CUDA 3.0 (Tesla V100-SXM2-16GB 14.2 GiB 1e:00.0)    
        core 2  PU 2    CPU 0   
        core 3  PU 3    CPU 1   
        core 4  PU 4    CPU 2   
        core 5  PU 5    CPU 3   
        core 6  PU 6    CPU 4   
        core 7  PU 7    CPU 5   
        core 8  PU 8    CPU 6   
        core 9  PU 9    CPU 7   
        core 10 PU 10   CPU 8   
        core 11 PU 11   CPU 9   
        core 12 PU 12   CPU 10  
        core 13 PU 13   CPU 11  
        core 14 PU 14   CPU 12  
        core 15 PU 15   CPU 13  
        core 16 PU 16   CPU 14  
        core 17 PU 17   CPU 15  
numa  1 pack  1 core 18 PU 18   CUDA 0.0 (Tesla V100-SXM2-16GB 14.2 GiB 1a:00.0)    
        core 19 PU 19   CUDA 2.0 (Tesla V100-SXM2-16GB 14.2 GiB 1d:00.0)    
        core 20 PU 20   CPU 16  
        core 21 PU 21   CPU 17  
        core 22 PU 22   CPU 18  
        core 23 PU 23   CPU 19  
        core 24 PU 24   CPU 20  
        core 25 PU 25   CPU 21  
        core 26 PU 26   CPU 22  
        core 27 PU 27   CPU 23  
        core 28 PU 28   CPU 24  
        core 29 PU 29   CPU 25  
        core 30 PU 30   CPU 26  
        core 31 PU 31   CPU 27  
        core 32 PU 32   CPU 28  
        core 33 PU 33   CPU 29  
        core 34 PU 34   CPU 30  
        core 35 PU 35   CPU 31  

bandwidth (MB/s) and latency (us)...
from/to NUMA 0  CUDA 0  CUDA 1  CUDA 2  CUDA 3  
NUMA 0  0   12312   12329   12333   12334   
CUDA 0  13094   0   47503   47708   47711   
CUDA 1  13103   47714   0   47713   47721   
CUDA 2  13097   47720   47705   0   47704   
CUDA 3  13103   47715   47715   47700   0   

NUMA 0  0   9   9   9   9   
CUDA 0  8   0   11  11  11  
CUDA 1  8   11  0   11  11  
CUDA 2  8   11  11  0   11  
CUDA 3  8   11  11  11  0   

GPU NUMA in preference order (logical index), host-to-device, device-to-host
CUDA_0   1   0  
CUDA_1   0   1  
CUDA_2   1   0  
CUDA_3   0   1

It is still strange for me why CUDA 0 is attached to NUMA 1, but seems like it is just an enumerating problem. Not a big deal. Running my application with the newly recompiled StarPU-1.3 gives me 25 Tflops/s, as it was before. It is still far from a perfect value (40 Tflops/s is what I aim, as a single V100 gives me 10 Tflops/s), but at least cublasSgemm works fast now.

Muxas commented 5 months ago

Comparison of data transfers for starpu-1.3 and starpu-1.4 for the same application with dm scheduler:

starpu-1.3:

Training performance: 23.6482709767857 Tflops/s
Loss on the last batch: 9.039822578430176
Shutdown cuBLAS

#---------------------
Data transfer stats:
    NUMA 0 -> CUDA 0    3.4629 GB   16.9099 MB/s    (transfers : 556 - avg 6.3778 MB)
    CUDA 0 -> NUMA 0    18.0936 GB  88.3529 MB/s    (transfers : 870 - avg 21.2964 MB)
    NUMA 0 -> CUDA 1    1.9955 GB   9.7442 MB/s (transfers : 514 - avg 3.9755 MB)
    CUDA 1 -> NUMA 0    13.7536 GB  67.1600 MB/s    (transfers : 598 - avg 23.5513 MB)
    CUDA 0 -> CUDA 1    974.8757 GB 4760.4133 MB/s  (transfers : 48106 - avg 20.7515 MB)
    CUDA 1 -> CUDA 0    939.0652 GB 4585.5472 MB/s  (transfers : 48247 - avg 19.9308 MB)
    NUMA 0 -> CUDA 2    1.7876 GB   8.7289 MB/s (transfers : 413 - avg 4.4321 MB)
    CUDA 2 -> NUMA 0    11.6667 GB  56.9697 MB/s    (transfers : 420 - avg 28.4445 MB)
    CUDA 0 -> CUDA 2    1263.5773 GB    6170.1702 MB/s  (transfers : 68762 - avg 18.8171 MB)
    CUDA 2 -> CUDA 0    1001.8527 GB    4892.1436 MB/s  (transfers : 54590 - avg 18.7928 MB)
    CUDA 1 -> CUDA 2    1141.3624 GB    5573.3827 MB/s  (transfers : 49602 - avg 23.5627 MB)
    CUDA 2 -> CUDA 1    1128.6001 GB    5511.0627 MB/s  (transfers : 51073 - avg 22.6281 MB)
    NUMA 0 -> CUDA 3    1.5092 GB   7.3698 MB/s (transfers : 505 - avg 3.0603 MB)
    CUDA 3 -> NUMA 0    4.6699 GB   22.8037 MB/s    (transfers : 220 - avg 21.7364 MB)
    CUDA 0 -> CUDA 3    882.4919 GB 4309.2919 MB/s  (transfers : 49213 - avg 18.3625 MB)
    CUDA 3 -> CUDA 0    1185.3740 GB    5788.2939 MB/s  (transfers : 63144 - avg 19.2231 MB)
    CUDA 1 -> CUDA 3    1354.6561 GB    6614.9140 MB/s  (transfers : 63505 - avg 21.8434 MB)
    CUDA 3 -> CUDA 1    1395.1045 GB    6812.4267 MB/s  (transfers : 60684 - avg 23.5414 MB)
    CUDA 2 -> CUDA 3    1008.7847 GB    4925.9904 MB/s  (transfers : 52911 - avg 19.5233 MB)
    CUDA 3 -> CUDA 2    890.3217 GB 4347.5241 MB/s  (transfers : 47680 - avg 19.1210 MB)
Total transfers: 13223.0059 GB
#---------------------

starpu-1.4:

Training performance: 18.760980099682563 Tflops/s
Loss on the last batch: 9.000470161437988
Shutdown cuBLAS

#---------------------
Data transfer stats:
    NUMA 0 -> CUDA 0    137.9982 GB 547.6573 MB/s   (transfers : 30791 - avg 4.5893 MB)
    CUDA 0 -> NUMA 0    13.4095 GB  53.2167 MB/s    (transfers : 2869 - avg 4.7861 MB)
    NUMA 0 -> CUDA 1    145.6027 GB 577.8361 MB/s   (transfers : 22479 - avg 6.6327 MB)
    CUDA 1 -> NUMA 0    12.5233 GB  49.6997 MB/s    (transfers : 1732 - avg 7.4041 MB)
    CUDA 0 -> CUDA 1    977.9714 GB 3881.1594 MB/s  (transfers : 56959 - avg 17.5818 MB)
    CUDA 1 -> CUDA 0    956.9282 GB 3797.6476 MB/s  (transfers : 56093 - avg 17.4691 MB)
    NUMA 0 -> CUDA 2    145.2166 GB 576.3039 MB/s   (transfers : 19100 - avg 7.7854 MB)
    CUDA 2 -> NUMA 0    7.8284 GB   31.0676 MB/s    (transfers : 1835 - avg 4.3685 MB)
    CUDA 0 -> CUDA 2    1205.8030 GB    4785.3267 MB/s  (transfers : 62539 - avg 19.7436 MB)
    CUDA 2 -> CUDA 0    1158.9956 GB    4599.5677 MB/s  (transfers : 58094 - avg 20.4292 MB)
    CUDA 1 -> CUDA 2    914.6765 GB 3629.9675 MB/s  (transfers : 51454 - avg 18.2032 MB)
    CUDA 2 -> CUDA 1    1001.9981 GB    3976.5101 MB/s  (transfers : 48166 - avg 21.3023 MB)
    NUMA 0 -> CUDA 3    150.3286 GB 596.5910 MB/s   (transfers : 22012 - avg 6.9933 MB)
    CUDA 3 -> NUMA 0    12.7629 GB  50.6505 MB/s    (transfers : 2163 - avg 6.0422 MB)
    CUDA 0 -> CUDA 3    987.5371 GB 3919.1200 MB/s  (transfers : 56623 - avg 17.8591 MB)
    CUDA 3 -> CUDA 0    981.4284 GB 3894.8770 MB/s  (transfers : 52884 - avg 19.0035 MB)
    CUDA 1 -> CUDA 3    1232.1252 GB    4889.7872 MB/s  (transfers : 70597 - avg 17.8718 MB)
    CUDA 3 -> CUDA 1    1272.3649 GB    5049.4811 MB/s  (transfers : 68597 - avg 18.9936 MB)
    CUDA 2 -> CUDA 3    1021.7541 GB    4054.9122 MB/s  (transfers : 47532 - avg 22.0120 MB)
    CUDA 3 -> CUDA 2    1045.3053 GB    4148.3769 MB/s  (transfers : 51153 - avg 20.9253 MB)
Total transfers: 13382.5576 GB
#---------------------

Amount of data transfers ar nearly the same, but StarPU-1.4 clearly sends much more data from NUMA 0 to CUDA devices compared to StarPU-1.3. Transfers from CUDA to NUMA are also increased. It influences overall application performance a lot.

sthibaul commented 5 months ago

Looking at the detail of the platform xml file, I see that the nvswitch is not detected, do you have libnvidia-ml detected? That shows up in the ./configure output as:

checking whether nvidia-ml should be used... yes

I however also need to add a small piece of code to make it known to the perfmodel. In the meanwhile, you can try to make _starpu_cuda_direct_link always return 1. Otherwise starpu 1.4 thinks the transfers go through the pci buses (starpu 1.3 doesn't care)

sthibaul commented 5 months ago

The CUDA: Also detect NVSwitch when checking the number of gpus sharing a bus commit should be doing it

sthibaul commented 5 months ago

Comparison of data transfers for starpu-1.3 and starpu-1.4 for the same application with dm scheduler:

The fix mentioned above can also fix that case, because we use the performance prediction for selecting the source node for transfers in _starpu_select_src_node, not only in the scheduler for task placement

sthibaul commented 5 months ago

It is still strange for me why CUDA 0 is attached to NUMA 1

Before starpu 1.4, we were just using the observed bandwidth to decide where to place the thread driving the gpu, so it might happen that with (mis-)luck, CUDA0 happens to get just a bit more bandwidth from NUMA1.

Starting from starpu 1.4 we use the hwloc information, which is much more stable :)

sthibaul commented 5 months ago

Nvidia A100 data sheet states 600GB/s

Do you know if there is a programmatic way to get this figure? (other than just measuring by starting transfers from all ends)

sthibaul commented 5 months ago

Nvidia A100 data sheet states 600GB/s

Do you know if there is a programmatic way to get this figure? (other than just measuring by starting transfers from all ends)

Ah, sorry, you meant the GPU bandwidth itself. I was thinking about the NVSwitch:

In a case of 8x Nvidia A100 we get less than 100GB/s instead of reported 250GB/s.

Do you mean that the total internal bandwidth of the NVSwitch doesn't allow a full 250GB/s for each GPU? Ideally that's the bandwidth I'd like to get access to. Possibly we'll just resort to just measuring it.

Muxas commented 5 months ago

Looking at the detail of the platform xml file, I see that the nvswitch is not detected, do you have libnvidia-ml detected? That shows up in the ./configure output as:

Turning off STARPU_SILENT showed me

[starpu][_starpu_init_cuda_config] Warning: could not find location of CUDA0, do you have the hwloc CUDA plugin installed?
[starpu][_starpu_init_cuda_config] Warning: could not find location of CUDA1, do you have the hwloc CUDA plugin installed?
[starpu][_starpu_init_cuda_config] Warning: could not find location of CUDA2, do you have the hwloc CUDA plugin installed?
[starpu][_starpu_init_cuda_config] Warning: could not find location of CUDA3, do you have the hwloc CUDA plugin installed?

And during configuration:

NVML found and can be compiled, but compiled application can not be run, you are probably on a machine without the CUDA driver
configure: WARNING: nvidia-ml could not be found. This will prevent from correct understanding of the machine topology.
checking whether nvidia-ml should be used... no

I see clearly that the library is present at /usr/lib64. But It is not used somehow.

sthibaul commented 5 months ago

Could you post the whole config.log?

Muxas commented 5 months ago

Surely! config.log

I am using a cluster with SLURM.So, I configure and compile on an access node, which lacks CUDA devices. Probably, it is the reason why nvidia-ml is marked as not found. It is found at first, and it can be even used for compilation. But, according to config.log, no CUDA devices is found and, therefore, libnvidia-ml is discarded.

Muxas commented 5 months ago

I am using a cluster with SLURM

It explains why recompiling the same code on an access mode, which was previously compiled on a compute node, gave totally different results (in one of the posts above).

Muxas commented 5 months ago

Seems like I have to compile all the prerequisites (fxt and hwloc) on compute nodes to get it work correctly.

Other issue is that I have conda python package manager installed and configure script finds hwloc-topo among conda files, which is incorrect. I compiled hwloc with hwloc-calc and somehow configure does not find it. Is there a way to point to a correct hwloc-calc?