Open Muxas opened 6 months ago
Dear StarPU team,
I think I figured out what is the reason for 10x performance drop of my application. I disabled kernels and printed bus stats for 1.3.11 and 1.4.4 versions of StarPU.
StarPU-1.3.11:
Total transfers: 29525.0176 GB
Real time including initialization: 4:58.20
Training performance: 1039.465649305438 Tflops/s
StarPU-1.4.4:
Total transfers: 44654.4844 GB
Real time: 17:05.05
Training performance: 58.44558141509535 Tflops/s
For some reason DMDAR and other DM** schedulers in StarPU-1.4.4 send nearly twice more data. And if I specifically take a look at the slowest part, namely PCI-express connection between CPU and GPUs, the new 1.4.4 version sends 65 times more data compared to old version 1.3.11.
Could you please advice if there is a way in StarPU-1.4.4 to put this data transmission overload of 1.4.4 StarPU back to where it was with 1.3.11 version?
I believe there is something wrong with the Memory Manager.
P.S. Enabling CUDA memory map leads to the following error: ../../src/datawizard/copy_driver.c:312: _starpu_driver_copy_data_1_to_1: Assertion `0 && "src_replicate->memory_node == dst_replicate->mapped"' failed.
P.P.S Using STARPU_REDUX leads to another but similar error. Seems like memory manager is bugged in 1.4.4 StarPU.
This is unexpected of course :)
Particularly since the 1.3 series introduces heuristics which are precisely meant to improve the overall flow of data.
AIUI, the involved matrices can completely fit in even just one GPU?
Could you also post results with starpu 1.3.0? To make sure whether it's the 1.2->1.3 development that introduced the first regression, or possibly some backports from the 1.4.x series to the 1.3.x series.
Could you also post the output of starpu_machine_display
obtained with the different versions, to make sure dmda gets the correct understanding of available PCI bandwidths, gpu placement etc.
Ideally, if you could provide your testcase with an LGPL-2.1+ licence, we could integrate it in our testsuite, and with simulation support we could add non-regression check-up.
Using STARPU_REDUX leads to another but similar error
That's not precise enough for us to be able to act :)
Output for the StarPU-1.3.0.
M=N=K=1024, mode=STARPU_RW|STARPU_COMMUTE
M=N=K=1024, mode=STARPU_REDUX
M=256, N=K=1532, mode=STARPU_RW|STARPU_COMMUTE
M=256, N=K=1532, mode=STARPU_REDUX
It seems, that 1.3.0 performs similar to 1.2.10.
Using STARPU_REDUX leads to another but similar error
That's not precise enough for us to be able to act :) As soon as we find out the source of increased timing with 1.3.11 version, I will prepare some minimal example where CUDA map and STARPU_REDUX lead to internal StarPU assertion failures.
I believe I misjudged the plots of StarPU-1.3.0. Take a look at the most right graphs.
StarPU-1.2.10: StarPU-1.3.0:
There is a gap between Eager and other smart schedulers. And then, for the StarPU-1.3.5 (below) the gap becomes larger.
M=N=K=1024, mode=STARPU_RW|STARPU_COMMUTE
M=N=K=1024, mode=STARPU_REDUX
M=256, N=K=1532, mode=STARPU_RW|STARPU_COMMUTE
M=256, N=K=1532, mode=STARPU_REDUX
And another update
I took a look at data transfers and total execution time (reported by /usr/bin/time utility) for different versions. Total amount of transferred data sorted in descending order:
As one can see, 1.3.x indeed saves a lot of data transmissions. However, 1.4.4 version brings all those transmissions back. Seems like some old code was brought back by 1.4.x release chain. The main problem comes from CPU<->GPU transfers, as 1.4.4 version transfers through a slow PCI-e bus around 65 times more than 1.3.11 version.
Here are the more detailed transmission reports provided by STARPU_BUS_STATS=1 env variable.
transfers_starpu_1.2.10_dmdar.txt
transfers_starpu_1.3.0_dmdar.txt
Using STARPU_REDUX leads to another but similar error
That's not precise enough for us to be able to act :) As soon as we find out the source of increased timing with 1.3.11 version, I will prepare some minimal example where CUDA map and STARPU_REDUX lead to internal StarPU assertion failures.
I mean: please provide the error message. "similar error" doesn't allow us to have any idea what this is about.
Also, again: Could you also post the output of starpu_machine_display obtained with the different versions, to make sure dmda gets the correct understanding of available PCI bandwidths, gpu placement etc.
Otherwise it's really not surprising that dm* etc. get everything wrong.
I don't have easy access to an 8-gpu machine, so I tried with simulation, and got results that actually see 1.4.4 get better result than 1.3.11 and 1.2.10... So I really need details on how things are going on the machine where you can reproduce the issue.
Also, providing us with the .starpu/sampling/bus/ and codelet/45/ files corresponding to the machine would allow me to simulate the exact same architecture, rather than simulating some 8-gpu machine I happened to have access to at some point.
Also, again: Could you also post the output of starpu_machine_display obtained with the different versions, to make sure dmda gets the correct understanding of available PCI bandwidths, gpu placement etc.
Here are the files: starpu_machine_display_1.2.10.txt starpu_machine_display_1.3.0.txt starpu_machine_display_1.3.11.txt starpu_machine_display_1.4.4.txt
P.S. How can I help you simulate my runs? I compiled StarPU without SimGrid support. Traces by FXT weight more than 1 GB. Do not know if giving you contents of codelets/45/ or codelets/44 will help. However, here are the contents of /bus samplings.
Here are the files:
Ok, so you have an nvswitch, which wasn't the case of the machine I was simulated, that can explain why I wasn't seeing the problem.
How can I help you simulate my runs?
By providing the information I'm asking :)
I compiled StarPU without SimGrid support.
Simgrid is only needed for the replay part, not for the calibration part.
Traces by FXT weight more than 1 GB
We don't need traces :)
Do not know if giving you contents of codelets/45/ or codelets/44 will help
Yes, please, to be sure to have the same timings as on your machine.
there's one odd thing here compared to the others: CUDA 0 has very low bandwidth, whatever the peer. Is this reproducible when you force bus re-calibration with STARPU_BUS_CALIBRATE=1
?
I double checked. It remains the same. CUDA 0 has 11 GB/s connection to CPU, other have 13-15 GB/s. With StarPU-1.3.11 the speeds are around 25 GB/s.
bandwidth (MB/s) and latency (us)...
from/to NUMA 0 CUDA 0 CUDA 1 CUDA 2 CUDA 3 CUDA 4 CUDA 5 CUDA 6 CUDA 7
NUMA 0 0 11625 14683 14671 14659 14661 14598 14587 14588
CUDA 0 11744 0 14721 14612 14629 14620 14727 14711 14707
CUDA 1 14661 13621 0 236193 241212 241259 241637 241287 241874
CUDA 2 14661 13722 243595 0 241024 243733 244475 244209 243717
CUDA 3 14684 13867 244122 243544 0 241130 243455 244064 244585
CUDA 4 14607 13908 240379 241467 246133 0 243570 243885 243641
CUDA 5 13484 15234 241671 241864 243550 247702 0 244909 244375
CUDA 6 13229 15071 241887 242582 244249 245052 247115 0 244958
CUDA 7 13528 15133 241470 241637 244368 244738 244376 247771 0
bandwidth (MB/s) and latency (us)...
from/to NUMA 0 CUDA 0 CUDA 1 CUDA 2 CUDA 3 CUDA 4 CUDA 5 CUDA 6 CUDA 7
NUMA 0 0 25081 25169 25150 25160 25094 25097 25086 25091
CUDA 0 23837 0 237628 245022 244849 243492 244425 244068 244064
CUDA 1 23837 244489 0 244506 244464 244747 244650 244639 244372
CUDA 2 23837 242112 248106 0 244695 243829 244250 245212 244295
CUDA 3 23836 241816 243238 247892 0 244343 244691 244443 244240
CUDA 4 23829 241359 243141 243036 247535 0 244676 244164 244173
CUDA 5 23908 241918 241900 243932 244365 247550 0 243878 244382
CUDA 6 23829 241531 241140 244337 244161 244080 246877 0 244022
CUDA 7 23830 242094 241616 244295 244201 243430 243710 244042 0
Looking at latencies of StarPU-1.4.4:
NUMA 0 0 0 10 9 9 10 9 9 9
CUDA 0 0 0 10 9 9 10 9 9 9
CUDA 1 12 12 0 14 14 14 14 14 14
CUDA 2 12 12 14 0 14 13 13 13 13
CUDA 3 11 12 14 13 0 13 13 13 13
CUDA 4 12 12 14 14 13 0 14 13 13
CUDA 5 12 12 13 13 12 12 0 12 12
CUDA 6 12 11 13 13 13 13 12 0 12
CUDA 7 12 11 13 13 13 13 12 12 0
StarPU thinks that CUDA 0 uses the same memory, as NUMA 0... Surprise!
CUDA 0 has 11 GB/s connection to CPU, other have 13-15 GB/s
Not only that, but also the gpu-gpu connexions are not getting the nvswitch speed, that's really odd.
StarPU thinks that CUDA 0 uses the same memory, as NUMA 0... Surprise!
The duplicates in the rows and columns and the 0 values in numa0/cuda0 are suspicious indeed.
It might be useful to see the config.log output in the 1.4.4 case.
StarPU thinks that CUDA 0 uses the same memory, as NUMA 0... Surprise!
The duplicates in the rows and columns and the 0 values in numa0/cuda0 are suspicious indeed.
and I can easily reproduce that here, good
(will work on it later next week, though, but at least we have a clear culprit here)
It might be useful to see the config.log output in the 1.4.4 case.
(will work on it later next week, though, but at least we have a clear culprit here)
Thank you! I will be on vacation next week, but after that I will prepare backtraces of initially described failed assertions for StarPU-1.4.4:
P.S. Enabling CUDA memory map leads to the following error: ../../src/datawizard/copy_driver.c:312: _starpu_driver_copy_data_1_to_1: Assertion `0 && "src_replicate->memory_node == dst_replicate->mapped"' failed.
P.P.S Using STARPU_REDUX leads to another but similar error. Seems like memory manager is bugged in 1.4.4 StarPU.
By the way, the version StarPU-1.3.11 gave me an error CUDA out-of-memory with STARPU_REDUX access modes. Setting STARPU_LIMIT_CUDA_MEM=60000
solved the issue. I will also try to find situation when it happened and create a backtrace.
Compiling StarPU-1.4.4 with a flag --enable-maxnumanodes=1 solves the issue with latencies and bandwidth bringing result of STARPU_machine_display to the same of version 1.3.11. However, performance if actual computations is the same as without the flag. Amount of data transfers is still large, as reported in one of the messages above.
Ok, I have pushed a fix for the bandwidth/latency management to gitlab, will appear on github by tomorrow. Now that the scheduler will have the proper values, I'll investigate the large amounts.
Ok, I have pushed a fix for the bandwidth/latency management to gitlab, will appear on github by tomorrow.
Thank you! I tried the new commit. It fixes output of starpu_machine_display
, but only partially. Throughput between CPU and GPUs remains low. I mean it is around 14 GB/s, as it was with StarPU-1.4.4. The version StarPU-1.3.11 reaches 25 GB/s. Output of starpu_machine_display
are attached for starpu-1.4 and starpu-1.3.11 tags.
It looks like there is an interference between the numa memory pinning and the nvidia memory pinning. I indeed see a small difference on my testbox, that might be emphasized on your box.
Another update. This time the hardware server is different (4x Nvidia V100 SXM2). For some strange reason CUDA workers require around 500 microseconds for any (even empty) task. Setting environment variable STARPU_WORKERS_NOBIND=1
brings this time down to around 5 microseconds for an empty task (which is still large in my opinion). But it improves overall performance 2x times for my application. Attached is the starpu_machine_display
for this new server (StarPU-1.3.11). Since servers with PCI-express CUDA GPUs do not suffer such a problem, I believe the problem is within hwloc-2.9.3 around Nvidia SXM bus.
For some strange reason CUDA workers require around 500 microseconds for any (even empty) task. Setting environment variable STARPU_WORKERS_NOBIND=1 brings this time down to around 5 microseconds for an empty task
It seems that thread binding got broken in the 1.3 series indeed. I backported some fixes from 1.4, which should fix it (by looking at the pci bus numbers in your v100 case the gpus should be driven from numa0, not 1)
5 microseconds for an empty task (which is still large in my opinion)
The CUDA cost itself is already that order of magnitude, unfortunately.
Since servers with PCI-express CUDA GPUs do not suffer such a problem
They probably have the same binding issue, just with much lower overhead probably.
It looks like there is an interference between the numa memory pinning and the nvidia memory pinning. I indeed see a small difference on my testbox, that might be emphasized on your box.
Ok, there was a typo in starpu-1.3 which didn't pose problem there, but ended up posing problem to 1.4, thus why it went unnoticed. This should now be fixed by Fix missing pinning memory when benchmarking bus with numa
in 1.4 (and not broken on 1.3), so the bandwidth numbers should now be fine, could you please check?
Then I'll check the scheduling part
I tried new commit in the starpu-1.3 branch and it got even worse, just like with starpu-1.4.4 case. Take a look at starpu_1.3.11_v100_machine_display.txt
Ok, with the update it gained the need for the same fix as in 1.4 (Fix bus performance models selection
), should now be fixed
New output of starpu_machine_display
:
StarPU has found :
32 STARPU_CPU_WORKER workers:
CPU 0
CPU 1
CPU 2
CPU 3
CPU 4
CPU 5
CPU 6
CPU 7
CPU 8
CPU 9
CPU 10
CPU 11
CPU 12
CPU 13
CPU 14
CPU 15
CPU 16
CPU 17
CPU 18
CPU 19
CPU 20
CPU 21
CPU 22
CPU 23
CPU 24
CPU 25
CPU 26
CPU 27
CPU 28
CPU 29
CPU 30
CPU 31
4 STARPU_CUDA_WORKER workers:
CUDA 0.0 (Tesla V100-SXM2-16GB 14.2 GiB 1a:00.0)
CUDA 1.0 (Tesla V100-SXM2-16GB 14.2 GiB 1c:00.0)
CUDA 2.0 (Tesla V100-SXM2-16GB 14.2 GiB 1d:00.0)
CUDA 3.0 (Tesla V100-SXM2-16GB 14.2 GiB 1e:00.0)
No STARPU_OPENCL_WORKER worker
topology ... (hwloc logical indexes)
numa 0 pack 0 core 0 PU 0 CUDA 1.0 (Tesla V100-SXM2-16GB 14.2 GiB 1c:00.0)
core 1 PU 1 CUDA 2.0 (Tesla V100-SXM2-16GB 14.2 GiB 1d:00.0)
core 2 PU 2 CUDA 3.0 (Tesla V100-SXM2-16GB 14.2 GiB 1e:00.0)
core 3 PU 3 CPU 0
core 4 PU 4 CPU 1
core 5 PU 5 CPU 2
core 6 PU 6 CPU 3
core 7 PU 7 CPU 4
core 8 PU 8 CPU 5
core 9 PU 9 CPU 6
core 10 PU 10 CPU 7
core 11 PU 11 CPU 8
core 12 PU 12 CPU 9
core 13 PU 13 CPU 10
core 14 PU 14 CPU 11
core 15 PU 15 CPU 12
core 16 PU 16 CPU 13
core 17 PU 17 CPU 14
numa 1 pack 1 core 18 PU 18 CUDA 0.0 (Tesla V100-SXM2-16GB 14.2 GiB 1a:00.0)
core 19 PU 19 CPU 15
core 20 PU 20 CPU 16
core 21 PU 21 CPU 17
core 22 PU 22 CPU 18
core 23 PU 23 CPU 19
core 24 PU 24 CPU 20
core 25 PU 25 CPU 21
core 26 PU 26 CPU 22
core 27 PU 27 CPU 23
core 28 PU 28 CPU 24
core 29 PU 29 CPU 25
core 30 PU 30 CPU 26
core 31 PU 31 CPU 27
core 32 PU 32 CPU 28
core 33 PU 33 CPU 29
core 34 PU 34 CPU 30
core 35 PU 35 CPU 31
bandwidth (MB/s) and latency (us)...
from/to NUMA 0 CUDA 0 CUDA 1 CUDA 2 CUDA 3
NUMA 0 0 12309 12329 12334 12331
CUDA 0 13092 0 47517 47695 47688
CUDA 1 13101 47693 0 47699 47704
CUDA 2 13102 47691 47689 0 47689
CUDA 3 13101 47694 47698 47694 0
NUMA 0 0 9 9 9 9
CUDA 0 9 0 11 11 11
CUDA 1 9 11 0 11 11
CUDA 2 9 11 11 0 11
CUDA 3 9 11 11 11 0
GPU NUMA in preference order (logical index), host-to-device, device-to-host
CUDA_0 1 0
CUDA_1 0 1
CUDA_2 0 1
CUDA_3 0 1
Seems like CUDA 0.0 on NUMA 1 is kind of strange
I tried recompiling starpu-1.3 and starpu-1.4 versions and found, that performance of my app on 8xNvidia A100 with starpu-1.3 has increased, but performance of my app with starpu-1.4 is still very very low (10 times difference). Still, there is some trouble with starpu-1.4 tag.
Output of starpu_nachine_display
seem to be correct for the starpu-1.4 tag after all your fixes.
starpu_machine_display_1.4.txt
However, dmdar
scheduler sends different amounts of data (STARPU_BUS_STATS=1 output):
starpu-1.3: Total transfers: 30162.5039 GB
starpu-1.4: Total transfers: 56256.7500 GB
These additional transfers of starpu-1.4 overwhelm all the computations. How can I help you dig into this issue?
StarPU-1.3 bus stats:
#---------------------
Data transfer stats:
NUMA 0 -> CUDA 0 24.4528 GB 84.9925 MB/s (transfers : 552 - avg 45.3616 MB)
CUDA 0 -> NUMA 0 138.6356 GB 481.8670 MB/s (transfers : 3052 - avg 46.5147 MB)
NUMA 0 -> CUDA 1 39.5344 GB 137.4128 MB/s (transfers : 970 - avg 41.7353 MB)
CUDA 1 -> NUMA 0 113.3956 GB 394.1384 MB/s (transfers : 2347 - avg 49.4747 MB)
CUDA 0 -> CUDA 1 324.7757 GB 1128.8490 MB/s (transfers : 6451 - avg 51.5533 MB)
CUDA 1 -> CUDA 0 555.4949 GB 1930.7786 MB/s (transfers : 9414 - avg 60.4235 MB)
NUMA 0 -> CUDA 2 40.8971 GB 142.1495 MB/s (transfers : 738 - avg 56.7462 MB)
CUDA 2 -> NUMA 0 135.2596 GB 470.1326 MB/s (transfers : 2260 - avg 61.2858 MB)
CUDA 0 -> CUDA 2 450.5016 GB 1565.8445 MB/s (transfers : 7163 - avg 64.4023 MB)
CUDA 2 -> CUDA 0 467.1385 GB 1623.6707 MB/s (transfers : 7189 - avg 66.5391 MB)
CUDA 1 -> CUDA 2 471.9199 GB 1640.2899 MB/s (transfers : 9051 - avg 53.3914 MB)
CUDA 2 -> CUDA 1 700.8458 GB 2435.9860 MB/s (transfers : 11589 - avg 61.9265 MB)
NUMA 0 -> CUDA 3 48.2185 GB 167.5969 MB/s (transfers : 726 - avg 68.0107 MB)
CUDA 3 -> NUMA 0 202.8596 GB 705.0954 MB/s (transfers : 3059 - avg 67.9072 MB)
CUDA 0 -> CUDA 3 395.4815 GB 1374.6067 MB/s (transfers : 6455 - avg 62.7379 MB)
CUDA 3 -> CUDA 0 486.9690 GB 1692.5970 MB/s (transfers : 7590 - avg 65.6991 MB)
CUDA 1 -> CUDA 3 524.3406 GB 1822.4922 MB/s (transfers : 9740 - avg 55.1257 MB)
CUDA 3 -> CUDA 1 353.8773 GB 1229.9993 MB/s (transfers : 6892 - avg 52.5784 MB)
CUDA 2 -> CUDA 3 337.0722 GB 1171.5887 MB/s (transfers : 4886 - avg 70.6431 MB)
CUDA 3 -> CUDA 2 724.7443 GB 2519.0510 MB/s (transfers : 10971 - avg 67.6454 MB)
NUMA 0 -> CUDA 4 39.5301 GB 137.3977 MB/s (transfers : 704 - avg 57.4983 MB)
CUDA 4 -> NUMA 0 51.4072 GB 178.6799 MB/s (transfers : 1050 - avg 50.1342 MB)
CUDA 0 -> CUDA 4 378.5577 GB 1315.7827 MB/s (transfers : 6424 - avg 60.3429 MB)
CUDA 4 -> CUDA 0 313.5901 GB 1089.9698 MB/s (transfers : 5567 - avg 57.6821 MB)
CUDA 1 -> CUDA 4 484.5900 GB 1684.3274 MB/s (transfers : 9445 - avg 52.5379 MB)
CUDA 4 -> CUDA 1 296.4626 GB 1030.4381 MB/s (transfers : 6054 - avg 50.1450 MB)
CUDA 2 -> CUDA 4 525.8330 GB 1827.6790 MB/s (transfers : 7021 - avg 76.6918 MB)
CUDA 4 -> CUDA 2 307.7303 GB 1069.6022 MB/s (transfers : 5506 - avg 57.2313 MB)
CUDA 3 -> CUDA 4 522.4628 GB 1815.9648 MB/s (transfers : 7565 - avg 70.7207 MB)
CUDA 4 -> CUDA 3 709.5829 GB 2466.3525 MB/s (transfers : 10193 - avg 71.2855 MB)
NUMA 0 -> CUDA 5 36.7464 GB 127.7223 MB/s (transfers : 623 - avg 60.3986 MB)
CUDA 5 -> NUMA 0 133.4323 GB 463.7810 MB/s (transfers : 2392 - avg 57.1215 MB)
CUDA 0 -> CUDA 5 492.1053 GB 1710.4486 MB/s (transfers : 7620 - avg 66.1307 MB)
CUDA 5 -> CUDA 0 447.4872 GB 1555.3660 MB/s (transfers : 6856 - avg 66.8359 MB)
CUDA 1 -> CUDA 5 716.0729 GB 2488.9098 MB/s (transfers : 10350 - avg 70.8462 MB)
CUDA 5 -> CUDA 1 402.6089 GB 1399.3788 MB/s (transfers : 7345 - avg 56.1295 MB)
CUDA 2 -> CUDA 5 459.1772 GB 1595.9974 MB/s (transfers : 7275 - avg 64.6319 MB)
CUDA 5 -> CUDA 2 512.6324 GB 1781.7958 MB/s (transfers : 7877 - avg 66.6416 MB)
CUDA 3 -> CUDA 5 559.6148 GB 1945.0961 MB/s (transfers : 7573 - avg 75.6696 MB)
CUDA 5 -> CUDA 3 551.3154 GB 1916.2491 MB/s (transfers : 8004 - avg 70.5331 MB)
CUDA 4 -> CUDA 5 403.6525 GB 1403.0059 MB/s (transfers : 7141 - avg 57.8827 MB)
CUDA 5 -> CUDA 4 667.2792 GB 2319.3132 MB/s (transfers : 10388 - avg 65.7772 MB)
NUMA 0 -> CUDA 6 36.5536 GB 127.0522 MB/s (transfers : 550 - avg 68.0562 MB)
CUDA 6 -> NUMA 0 126.1244 GB 438.3802 MB/s (transfers : 2126 - avg 60.7485 MB)
CUDA 0 -> CUDA 6 499.6056 GB 1736.5172 MB/s (transfers : 7533 - avg 67.9140 MB)
CUDA 6 -> CUDA 0 503.7355 GB 1750.8716 MB/s (transfers : 7365 - avg 70.0374 MB)
CUDA 1 -> CUDA 6 536.4423 GB 1864.5532 MB/s (transfers : 8588 - avg 63.9633 MB)
CUDA 6 -> CUDA 1 429.3033 GB 1492.1621 MB/s (transfers : 6934 - avg 63.3987 MB)
CUDA 2 -> CUDA 6 534.6896 GB 1858.4609 MB/s (transfers : 8164 - avg 67.0654 MB)
CUDA 6 -> CUDA 2 343.1891 GB 1192.8483 MB/s (transfers : 5441 - avg 64.5884 MB)
CUDA 3 -> CUDA 6 431.0230 GB 1498.1391 MB/s (transfers : 5536 - avg 79.7268 MB)
CUDA 6 -> CUDA 3 535.2645 GB 1860.4591 MB/s (transfers : 6578 - avg 83.3249 MB)
CUDA 4 -> CUDA 6 397.4898 GB 1381.5853 MB/s (transfers : 5752 - avg 70.7631 MB)
CUDA 6 -> CUDA 4 347.6010 GB 1208.1827 MB/s (transfers : 5720 - avg 62.2279 MB)
CUDA 5 -> CUDA 6 571.9941 GB 1988.1228 MB/s (transfers : 7757 - avg 75.5088 MB)
CUDA 6 -> CUDA 5 773.4525 GB 2688.3467 MB/s (transfers : 10802 - avg 73.3212 MB)
NUMA 0 -> CUDA 7 32.4117 GB 112.6559 MB/s (transfers : 716 - avg 46.3542 MB)
CUDA 7 -> NUMA 0 101.7807 GB 353.7668 MB/s (transfers : 2028 - avg 51.3922 MB)
CUDA 0 -> CUDA 7 565.1070 GB 1964.1844 MB/s (transfers : 9144 - avg 63.2841 MB)
CUDA 7 -> CUDA 0 539.1749 GB 1874.0503 MB/s (transfers : 8681 - avg 63.6004 MB)
CUDA 1 -> CUDA 7 665.1335 GB 2311.8539 MB/s (transfers : 12870 - avg 52.9213 MB)
CUDA 7 -> CUDA 1 541.7370 GB 1882.9554 MB/s (transfers : 10455 - avg 53.0597 MB)
CUDA 2 -> CUDA 7 705.2172 GB 2451.1754 MB/s (transfers : 11433 - avg 63.1630 MB)
CUDA 7 -> CUDA 2 652.1891 GB 2266.8617 MB/s (transfers : 10446 - avg 63.9328 MB)
CUDA 3 -> CUDA 7 704.0863 GB 2447.2446 MB/s (transfers : 11128 - avg 64.7901 MB)
CUDA 7 -> CUDA 3 573.6588 GB 1993.9081 MB/s (transfers : 8631 - avg 68.0601 MB)
CUDA 4 -> CUDA 7 600.3521 GB 2086.6877 MB/s (transfers : 10220 - avg 60.1527 MB)
CUDA 7 -> CUDA 4 494.1706 GB 1717.6250 MB/s (transfers : 8540 - avg 59.2542 MB)
CUDA 5 -> CUDA 7 354.7237 GB 1232.9392 MB/s (transfers : 6475 - avg 56.0984 MB)
CUDA 7 -> CUDA 5 547.3857 GB 1902.5886 MB/s (transfers : 8801 - avg 63.6886 MB)
CUDA 6 -> CUDA 7 639.4545 GB 2222.5987 MB/s (transfers : 9652 - avg 67.8410 MB)
CUDA 7 -> CUDA 6 831.1680 GB 2888.9510 MB/s (transfers : 12008 - avg 70.8791 MB)
Total transfers: 30162.5039 GB
StarPU-1.4 bus stats:
#---------------------
Data transfer stats:
NUMA 0 -> CUDA 0 1338.7977 GB 823.5640 MB/s (transfers : 20459 - avg 67.0086 MB)
CUDA 0 -> NUMA 0 953.6840 GB 586.6606 MB/s (transfers : 16224 - avg 60.1931 MB)
NUMA 0 -> CUDA 1 2220.5159 GB 1365.9547 MB/s (transfers : 27024 - avg 84.1403 MB)
CUDA 1 -> NUMA 0 680.4276 GB 418.5663 MB/s (transfers : 10676 - avg 65.2639 MB)
CUDA 0 -> CUDA 1 384.6295 GB 236.6056 MB/s (transfers : 5589 - avg 70.4707 MB)
CUDA 1 -> CUDA 0 632.2552 GB 388.9330 MB/s (transfers : 9657 - avg 67.0425 MB)
NUMA 0 -> CUDA 2 1974.4674 GB 1214.5974 MB/s (transfers : 24295 - avg 83.2210 MB)
CUDA 2 -> NUMA 0 736.1757 GB 452.8599 MB/s (transfers : 11703 - avg 64.4146 MB)
CUDA 0 -> CUDA 2 464.1732 GB 285.5370 MB/s (transfers : 7218 - avg 65.8511 MB)
CUDA 2 -> CUDA 0 460.1991 GB 283.0924 MB/s (transfers : 6748 - avg 69.8346 MB)
CUDA 1 -> CUDA 2 378.5329 GB 232.8552 MB/s (transfers : 5487 - avg 70.6429 MB)
CUDA 2 -> CUDA 1 847.5342 GB 521.3623 MB/s (transfers : 10883 - avg 79.7459 MB)
NUMA 0 -> CUDA 3 2348.7388 GB 1444.8311 MB/s (transfers : 29448 - avg 81.6731 MB)
CUDA 3 -> NUMA 0 713.3685 GB 438.8300 MB/s (transfers : 10214 - avg 71.5184 MB)
CUDA 0 -> CUDA 3 558.1124 GB 343.3239 MB/s (transfers : 8283 - avg 68.9976 MB)
CUDA 3 -> CUDA 0 480.9552 GB 295.8605 MB/s (transfers : 6561 - avg 75.0645 MB)
CUDA 1 -> CUDA 3 386.2551 GB 237.6056 MB/s (transfers : 5257 - avg 75.2378 MB)
CUDA 3 -> CUDA 1 508.9437 GB 313.0777 MB/s (transfers : 6956 - avg 74.9221 MB)
CUDA 2 -> CUDA 3 400.0684 GB 246.1028 MB/s (transfers : 7030 - avg 58.2745 MB)
CUDA 3 -> CUDA 2 789.4544 GB 485.6344 MB/s (transfers : 10949 - avg 73.8333 MB)
NUMA 0 -> CUDA 4 2626.2900 GB 1615.5672 MB/s (transfers : 31490 - avg 85.4024 MB)
CUDA 4 -> NUMA 0 750.1584 GB 461.4614 MB/s (transfers : 8682 - avg 88.4776 MB)
CUDA 0 -> CUDA 4 658.1709 GB 404.8751 MB/s (transfers : 9663 - avg 69.7472 MB)
CUDA 4 -> CUDA 0 618.8331 GB 380.6763 MB/s (transfers : 8840 - avg 71.6838 MB)
CUDA 1 -> CUDA 4 418.9629 GB 257.7258 MB/s (transfers : 6602 - avg 64.9830 MB)
CUDA 4 -> CUDA 1 566.4614 GB 348.4598 MB/s (transfers : 8225 - avg 70.5236 MB)
CUDA 2 -> CUDA 4 494.4859 GB 304.1839 MB/s (transfers : 8537 - avg 59.3128 MB)
CUDA 4 -> CUDA 2 445.8077 GB 274.2394 MB/s (transfers : 6456 - avg 70.7105 MB)
CUDA 3 -> CUDA 4 460.8046 GB 283.4648 MB/s (transfers : 7233 - avg 65.2376 MB)
CUDA 4 -> CUDA 3 943.5418 GB 580.4215 MB/s (transfers : 13386 - avg 72.1789 MB)
NUMA 0 -> CUDA 5 122.0536 GB 75.0815 MB/s (transfers : 10087 - avg 12.3905 MB)
CUDA 5 -> NUMA 0 1185.4202 GB 729.2134 MB/s (transfers : 19643 - avg 61.7966 MB)
CUDA 0 -> CUDA 5 968.1405 GB 595.5534 MB/s (transfers : 16301 - avg 60.8169 MB)
CUDA 5 -> CUDA 0 523.2678 GB 321.8892 MB/s (transfers : 9131 - avg 58.6821 MB)
CUDA 1 -> CUDA 5 842.0186 GB 517.9693 MB/s (transfers : 13309 - avg 64.7853 MB)
CUDA 5 -> CUDA 1 472.1972 GB 290.4729 MB/s (transfers : 7634 - avg 63.3390 MB)
CUDA 2 -> CUDA 5 698.3336 GB 429.5812 MB/s (transfers : 13098 - avg 54.5956 MB)
CUDA 5 -> CUDA 2 496.5089 GB 305.4283 MB/s (transfers : 8367 - avg 60.7655 MB)
CUDA 3 -> CUDA 5 680.8994 GB 418.8565 MB/s (transfers : 10590 - avg 65.8396 MB)
CUDA 5 -> CUDA 3 735.2136 GB 452.2679 MB/s (transfers : 11432 - avg 65.8554 MB)
CUDA 4 -> CUDA 5 881.4655 GB 542.2351 MB/s (transfers : 13900 - avg 64.9367 MB)
CUDA 5 -> CUDA 4 1018.6028 GB 626.5954 MB/s (transfers : 15072 - avg 69.2044 MB)
NUMA 0 -> CUDA 6 1623.2544 GB 998.5478 MB/s (transfers : 23629 - avg 70.3463 MB)
CUDA 6 -> NUMA 0 935.3520 GB 575.3834 MB/s (transfers : 13628 - avg 70.2818 MB)
CUDA 0 -> CUDA 6 585.4160 GB 360.1197 MB/s (transfers : 9800 - avg 61.1700 MB)
CUDA 6 -> CUDA 0 512.5333 GB 315.2857 MB/s (transfers : 7867 - avg 66.7134 MB)
CUDA 1 -> CUDA 6 705.4360 GB 433.9502 MB/s (transfers : 11261 - avg 64.1476 MB)
CUDA 6 -> CUDA 1 596.2983 GB 366.8140 MB/s (transfers : 8717 - avg 70.0481 MB)
CUDA 2 -> CUDA 6 533.6901 GB 328.3004 MB/s (transfers : 7509 - avg 72.7791 MB)
CUDA 6 -> CUDA 2 576.7213 GB 354.7711 MB/s (transfers : 8479 - avg 69.6500 MB)
CUDA 3 -> CUDA 6 604.8471 GB 372.0727 MB/s (transfers : 8328 - avg 74.3712 MB)
CUDA 6 -> CUDA 3 517.1010 GB 318.0955 MB/s (transfers : 7923 - avg 66.8322 MB)
CUDA 4 -> CUDA 6 757.3788 GB 465.9029 MB/s (transfers : 11494 - avg 67.4749 MB)
CUDA 6 -> CUDA 4 561.9124 GB 345.6614 MB/s (transfers : 8608 - avg 66.8446 MB)
CUDA 5 -> CUDA 6 499.9295 GB 307.5325 MB/s (transfers : 9079 - avg 56.3859 MB)
CUDA 6 -> CUDA 5 1239.6105 GB 762.5485 MB/s (transfers : 19311 - avg 65.7325 MB)
NUMA 0 -> CUDA 7 1688.1376 GB 1038.4607 MB/s (transfers : 24379 - avg 70.9075 MB)
CUDA 7 -> NUMA 0 1012.5004 GB 622.8414 MB/s (transfers : 17842 - avg 58.1101 MB)
CUDA 0 -> CUDA 7 642.5500 GB 395.2658 MB/s (transfers : 11577 - avg 56.8343 MB)
CUDA 7 -> CUDA 0 473.7164 GB 291.4075 MB/s (transfers : 8139 - avg 59.6002 MB)
CUDA 1 -> CUDA 7 616.4796 GB 379.2284 MB/s (transfers : 10562 - avg 59.7685 MB)
CUDA 7 -> CUDA 1 514.6750 GB 316.6032 MB/s (transfers : 7356 - avg 71.6459 MB)
CUDA 2 -> CUDA 7 636.5130 GB 391.5520 MB/s (transfers : 9612 - avg 67.8100 MB)
CUDA 7 -> CUDA 2 732.0572 GB 450.3262 MB/s (transfers : 10375 - avg 72.2532 MB)
CUDA 3 -> CUDA 7 615.6310 GB 378.7064 MB/s (transfers : 9228 - avg 68.3145 MB)
CUDA 7 -> CUDA 3 608.7626 GB 374.4814 MB/s (transfers : 9402 - avg 66.3022 MB)
CUDA 4 -> CUDA 7 948.1803 GB 583.2747 MB/s (transfers : 14035 - avg 69.1797 MB)
CUDA 7 -> CUDA 4 744.2115 GB 457.8030 MB/s (transfers : 11214 - avg 67.9573 MB)
CUDA 5 -> CUDA 7 657.0583 GB 404.1905 MB/s (transfers : 11256 - avg 59.7750 MB)
CUDA 7 -> CUDA 5 880.2009 GB 541.4570 MB/s (transfers : 14233 - avg 63.3265 MB)
CUDA 6 -> CUDA 7 674.8632 GB 415.1432 MB/s (transfers : 10820 - avg 63.8687 MB)
CUDA 7 -> CUDA 6 696.8048 GB 428.6406 MB/s (transfers : 10635 - avg 67.0924 MB)
Total transfers: 56256.7500 GB
One can easily see starpu-1.4 sends much more data between CPU-GPU through PCI-express bus. Seems like communications through PCI-e bus limit performance.
How can I help you dig into this issue?
I still need to take the time to reproduce the case in simulation. Perhaps you can post the updated sampling/bus/ directory with all the now-proper bandwidths?
Perhaps you can post the updated sampling/bus/ directory with all the now-proper bandwidths?
Sure, here it is: bus-starpu-1.4.tar.gz
Another update (sorry for spam): setting STARPU_NCUDA=4
for starpu-1.4 solves the problem. I am getting half of starpu-1.3 performance with 8x Nvidia A100 gpus. I believe reason for bad performance is the NVSwitch. Bandwidth between two GPUs, measured when no other GPU is transferring data, is much higher compared to a situation when all 8 GPUs transfer data to each other. For example: GPU-GPU bandwidth measured by starpu_machine_display
shows around 250GB/s, but Nvidia A100 data sheet states 600GB/s. I believe 600GB/s is a maximal throughput from a single GPU to other GPUs. In a case of 8x Nvidia A100 we get less than 100GB/s instead of reported 250GB/s.
Hi again! Just tested updated starpu-1.3 branch (commit ed1956801806d6bf51bbf859f0908872b902ec04 of GitHub repo) on a server with 4x Nvidia V100. Here is the starpu_nachine_display
output:
StarPU has found :
4 STARPU_CPU_WORKER workers:
CPU 0
CPU 1
CPU 2
CPU 3
4 STARPU_CUDA_WORKER workers:
CUDA 0.0 (Tesla V100-SXM2-16GB 14.2 GiB 1a:00.0)
CUDA 1.0 (Tesla V100-SXM2-16GB 14.2 GiB 1c:00.0)
CUDA 2.0 (Tesla V100-SXM2-16GB 14.2 GiB 1d:00.0)
CUDA 3.0 (Tesla V100-SXM2-16GB 14.2 GiB 1e:00.0)
No STARPU_OPENCL_WORKER worker
topology ... (hwloc logical indexes)
numa 0 pack 0 core 0 PU 0 CUDA 1.0 (Tesla V100-SXM2-16GB 14.2 GiB 1c:00.0)
core 1 PU 1 CUDA 2.0 (Tesla V100-SXM2-16GB 14.2 GiB 1d:00.0)
core 2 PU 2 CUDA 3.0 (Tesla V100-SXM2-16GB 14.2 GiB 1e:00.0)
core 3 PU 3 CPU 0
core 4 PU 4 CPU 1
core 5 PU 5 CPU 2
core 6 PU 6 CPU 3
core 7 PU 7
core 8 PU 8
core 9 PU 9
core 10 PU 10
core 11 PU 11
core 12 PU 12
core 13 PU 13
core 14 PU 14
core 15 PU 15
core 16 PU 16
core 17 PU 17
numa 1 pack 1 core 18 PU 18 CUDA 0.0 (Tesla V100-SXM2-16GB 14.2 GiB 1a:00.0)
core 19 PU 19
core 20 PU 20
core 21 PU 21
core 22 PU 22
core 23 PU 23
core 24 PU 24
core 25 PU 25
core 26 PU 26
core 27 PU 27
core 28 PU 28
core 29 PU 29
core 30 PU 30
core 31 PU 31
core 32 PU 32
core 33 PU 33
core 34 PU 34
core 35 PU 35
bandwidth (MB/s) and latency (us)...
from/to NUMA 0 CUDA 0 CUDA 1 CUDA 2 CUDA 3
NUMA 0 0 12309 12329 12332 12332
CUDA 0 13103 0 47507 47698 47698
CUDA 1 13102 47687 0 47705 47696
CUDA 2 13101 47691 47698 0 47688
CUDA 3 13102 47704 47700 47684 0
NUMA 0 0 9 9 9 9
CUDA 0 9 0 11 11 11
CUDA 1 9 11 0 11 11
CUDA 2 8 11 11 0 11
CUDA 3 9 11 11 11 0
GPU NUMA in preference order (logical index), host-to-device, device-to-host
CUDA_0 0 12309 13103 1 12327 13098
CUDA_1 0 12329 13102 1 12325 13099
CUDA_2 0 12332 13101 1 12327 13098
CUDA_3 0 12332 13102 1 12329 13098
CUDA 0 is on NUMA 1 for some reason again.
Tried commit https://github.com/starpu-runtime/starpu/commit/ed1956801806d6bf51bbf859f0908872b902ec04 of GitHub repo on a server with 4x V100 and got upsetting surprise for my application:
1) starpu-1.3.11 with STARPU_WORKERS_NOBIND=1
performs with 25 Tflops/s.
2) starpu-1.3.11 with STARPU_WORKERS_NOBIND=0
performs with 12 Tflops/s .
3) starpu-1.3 performs with 1.8 Tflops/s.
Everything seems to be the same, only version of StarPU is different.
Here is performance model file for cublasSgemm for the latest commit of starpu-1.3 branch:
##################
# Performance Model Version
45
####################
# COMBs
# number of combinations
4
####################
# COMB_3
# number of types devices
1
####################
# DEV_0
# device type (CPU - 0, CUDA - 1, OPENCL - 2, MIC - 3, MPI_MS - 5)
1
####################
# DEV_0
# device id
0
####################
# DEV_0
# number of cores
1
##########
# number of implementations
1
#####
# Model for cuda0_impl0 (Comb3)
# number of entries
2
# sumlnx sumlnx2 sumlny sumlnxlny alpha beta n minx maxx
0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 nan nan 0 0 0
# a b c
nan nan nan
# not multiple-regression-base
0
# hash size flops mean (us) dev (us) sum sum2 n
6cac4676 423624704 1.073742e+11 7.524503e+03 8.031120e+02 1.023332e+07 7.787787e+10 1360
9a5c4e6a 12582912 2.147484e+09 4.700963e+02 2.476145e+01 5.017902e+07 2.365442e+10 106742
####################
# COMB_1
# number of types devices
1
####################
# DEV_0
# device type (CPU - 0, CUDA - 1, OPENCL - 2, MIC - 3, MPI_MS - 5)
1
####################
# DEV_0
# device id
1
####################
# DEV_0
# number of cores
1
##########
# number of implementations
1
#####
# Model for cuda1_impl0 (Comb1)
# number of entries
2
# sumlnx sumlnx2 sumlny sumlnxlny alpha beta n minx maxx
0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 nan nan 0 0 0
# a b c
nan nan nan
# not multiple-regression-base
0
# hash size flops mean (us) dev (us) sum sum2 n
6cac4676 423624704 1.073742e+11 9.384886e+03 1.617014e+03 1.745589e+06 1.686849e+10 186
9a5c4e6a 12582912 2.147484e+09 2.045799e+03 4.604698e+02 1.378459e+07 2.962918e+10 6738
####################
# COMB_2
# number of types devices
1
####################
# DEV_0
# device type (CPU - 0, CUDA - 1, OPENCL - 2, MIC - 3, MPI_MS - 5)
1
####################
# DEV_0
# device id
2
####################
# DEV_0
# number of cores
1
##########
# number of implementations
1
#####
# Model for cuda2_impl0 (Comb2)
# number of entries
2
# sumlnx sumlnx2 sumlny sumlnxlny alpha beta n minx maxx
0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 nan nan 0 0 0
# a b c
nan nan nan
# not multiple-regression-base
0
# hash size flops mean (us) dev (us) sum sum2 n
6cac4676 423624704 1.073742e+11 1.011996e+04 1.480795e+03 1.629313e+06 1.684161e+10 161
9a5c4e6a 12582912 2.147484e+09 2.054684e+03 4.624059e+02 1.328148e+07 2.867137e+10 6464
####################
# COMB_0
# number of types devices
1
####################
# DEV_0
# device type (CPU - 0, CUDA - 1, OPENCL - 2, MIC - 3, MPI_MS - 5)
1
####################
# DEV_0
# device id
3
####################
# DEV_0
# number of cores
1
##########
# number of implementations
1
#####
# Model for cuda3_impl0 (Comb0)
# number of entries
2
# sumlnx sumlnx2 sumlny sumlnxlny alpha beta n minx maxx
0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 nan nan 0 0 0
# a b c
nan nan nan
# not multiple-regression-base
0
# hash size flops mean (us) dev (us) sum sum2 n
6cac4676 423624704 1.073742e+11 9.142048e+03 1.613087e+03 3.144865e+06 2.964561e+10 344
9a5c4e6a 12582912 2.147484e+09 2.102459e+03 4.772954e+02 1.144789e+07 2.530914e+10 5445
Gemm performance of cuBLAS is around 5 Tflops/s for CUDA 0 and around 1 Tflops/s for other CUDA devices. It shall be around 14 Tflops/s.
For reference, performance model for STARPU_NCUDA=1
at starpu-1.3.11:
##################
# Performance Model Version
45
####################
# COMBs
# number of combinations
1
####################
# COMB_0
# number of types devices
1
####################
# DEV_0
# device type (CPU - 0, CUDA - 1, OPENCL - 2, MIC - 3, MPI_MS - 5)
1
####################
# DEV_0
# device id
0
####################
# DEV_0
# number of cores
1
##########
# number of implementations
1
#####
# Model for cuda0_impl0 (Comb0)
# number of entries
3
# sumlnx sumlnx2 sumlny sumlnxlny alpha beta n minx maxx
0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 nan nan 0 0 0
# a b c
nan nan nan
# not multiple-regression-base
0
# hash size flops mean (us) dev (us) sum sum2 n
d295bde2 1065353216 4.294967e+11 2.963925e+04 8.908781e+02 2.934285e+06 8.704859e+10 99
ca98c721 100663296 3.435974e+10 2.801510e+03 4.595160e+02 1.834989e+06 5.279048e+09 655
2465b5fe 37748736 8.589935e+09 5.997161e+02 3.460896e+01 1.583850e+06 9.530240e+08 2641
Sorry for another round of messages. I tried recompiling starpu-1.3 tag from a scratch just to double check. This time output of starpu_machine_display
is much better:
StarPU has found :
32 STARPU_CPU_WORKER workers:
CPU 0
CPU 1
CPU 2
CPU 3
CPU 4
CPU 5
CPU 6
CPU 7
CPU 8
CPU 9
CPU 10
CPU 11
CPU 12
CPU 13
CPU 14
CPU 15
CPU 16
CPU 17
CPU 18
CPU 19
CPU 20
CPU 21
CPU 22
CPU 23
CPU 24
CPU 25
CPU 26
CPU 27
CPU 28
CPU 29
CPU 30
CPU 31
4 STARPU_CUDA_WORKER workers:
CUDA 0.0 (Tesla V100-SXM2-16GB 14.2 GiB 1a:00.0)
CUDA 1.0 (Tesla V100-SXM2-16GB 14.2 GiB 1c:00.0)
CUDA 2.0 (Tesla V100-SXM2-16GB 14.2 GiB 1d:00.0)
CUDA 3.0 (Tesla V100-SXM2-16GB 14.2 GiB 1e:00.0)
No STARPU_OPENCL_WORKER worker
topology ... (hwloc logical indexes)
numa 0 pack 0 core 0 PU 0 CUDA 1.0 (Tesla V100-SXM2-16GB 14.2 GiB 1c:00.0)
core 1 PU 1 CUDA 3.0 (Tesla V100-SXM2-16GB 14.2 GiB 1e:00.0)
core 2 PU 2 CPU 0
core 3 PU 3 CPU 1
core 4 PU 4 CPU 2
core 5 PU 5 CPU 3
core 6 PU 6 CPU 4
core 7 PU 7 CPU 5
core 8 PU 8 CPU 6
core 9 PU 9 CPU 7
core 10 PU 10 CPU 8
core 11 PU 11 CPU 9
core 12 PU 12 CPU 10
core 13 PU 13 CPU 11
core 14 PU 14 CPU 12
core 15 PU 15 CPU 13
core 16 PU 16 CPU 14
core 17 PU 17 CPU 15
numa 1 pack 1 core 18 PU 18 CUDA 0.0 (Tesla V100-SXM2-16GB 14.2 GiB 1a:00.0)
core 19 PU 19 CUDA 2.0 (Tesla V100-SXM2-16GB 14.2 GiB 1d:00.0)
core 20 PU 20 CPU 16
core 21 PU 21 CPU 17
core 22 PU 22 CPU 18
core 23 PU 23 CPU 19
core 24 PU 24 CPU 20
core 25 PU 25 CPU 21
core 26 PU 26 CPU 22
core 27 PU 27 CPU 23
core 28 PU 28 CPU 24
core 29 PU 29 CPU 25
core 30 PU 30 CPU 26
core 31 PU 31 CPU 27
core 32 PU 32 CPU 28
core 33 PU 33 CPU 29
core 34 PU 34 CPU 30
core 35 PU 35 CPU 31
bandwidth (MB/s) and latency (us)...
from/to NUMA 0 CUDA 0 CUDA 1 CUDA 2 CUDA 3
NUMA 0 0 12312 12329 12333 12334
CUDA 0 13094 0 47503 47708 47711
CUDA 1 13103 47714 0 47713 47721
CUDA 2 13097 47720 47705 0 47704
CUDA 3 13103 47715 47715 47700 0
NUMA 0 0 9 9 9 9
CUDA 0 8 0 11 11 11
CUDA 1 8 11 0 11 11
CUDA 2 8 11 11 0 11
CUDA 3 8 11 11 11 0
GPU NUMA in preference order (logical index), host-to-device, device-to-host
CUDA_0 1 0
CUDA_1 0 1
CUDA_2 1 0
CUDA_3 0 1
It is still strange for me why CUDA 0 is attached to NUMA 1, but seems like it is just an enumerating problem. Not a big deal. Running my application with the newly recompiled StarPU-1.3 gives me 25 Tflops/s, as it was before. It is still far from a perfect value (40 Tflops/s is what I aim, as a single V100 gives me 10 Tflops/s), but at least cublasSgemm works fast now.
Comparison of data transfers for starpu-1.3 and starpu-1.4 for the same application with dm
scheduler:
Training performance: 23.6482709767857 Tflops/s
Loss on the last batch: 9.039822578430176
Shutdown cuBLAS
#---------------------
Data transfer stats:
NUMA 0 -> CUDA 0 3.4629 GB 16.9099 MB/s (transfers : 556 - avg 6.3778 MB)
CUDA 0 -> NUMA 0 18.0936 GB 88.3529 MB/s (transfers : 870 - avg 21.2964 MB)
NUMA 0 -> CUDA 1 1.9955 GB 9.7442 MB/s (transfers : 514 - avg 3.9755 MB)
CUDA 1 -> NUMA 0 13.7536 GB 67.1600 MB/s (transfers : 598 - avg 23.5513 MB)
CUDA 0 -> CUDA 1 974.8757 GB 4760.4133 MB/s (transfers : 48106 - avg 20.7515 MB)
CUDA 1 -> CUDA 0 939.0652 GB 4585.5472 MB/s (transfers : 48247 - avg 19.9308 MB)
NUMA 0 -> CUDA 2 1.7876 GB 8.7289 MB/s (transfers : 413 - avg 4.4321 MB)
CUDA 2 -> NUMA 0 11.6667 GB 56.9697 MB/s (transfers : 420 - avg 28.4445 MB)
CUDA 0 -> CUDA 2 1263.5773 GB 6170.1702 MB/s (transfers : 68762 - avg 18.8171 MB)
CUDA 2 -> CUDA 0 1001.8527 GB 4892.1436 MB/s (transfers : 54590 - avg 18.7928 MB)
CUDA 1 -> CUDA 2 1141.3624 GB 5573.3827 MB/s (transfers : 49602 - avg 23.5627 MB)
CUDA 2 -> CUDA 1 1128.6001 GB 5511.0627 MB/s (transfers : 51073 - avg 22.6281 MB)
NUMA 0 -> CUDA 3 1.5092 GB 7.3698 MB/s (transfers : 505 - avg 3.0603 MB)
CUDA 3 -> NUMA 0 4.6699 GB 22.8037 MB/s (transfers : 220 - avg 21.7364 MB)
CUDA 0 -> CUDA 3 882.4919 GB 4309.2919 MB/s (transfers : 49213 - avg 18.3625 MB)
CUDA 3 -> CUDA 0 1185.3740 GB 5788.2939 MB/s (transfers : 63144 - avg 19.2231 MB)
CUDA 1 -> CUDA 3 1354.6561 GB 6614.9140 MB/s (transfers : 63505 - avg 21.8434 MB)
CUDA 3 -> CUDA 1 1395.1045 GB 6812.4267 MB/s (transfers : 60684 - avg 23.5414 MB)
CUDA 2 -> CUDA 3 1008.7847 GB 4925.9904 MB/s (transfers : 52911 - avg 19.5233 MB)
CUDA 3 -> CUDA 2 890.3217 GB 4347.5241 MB/s (transfers : 47680 - avg 19.1210 MB)
Total transfers: 13223.0059 GB
#---------------------
Training performance: 18.760980099682563 Tflops/s
Loss on the last batch: 9.000470161437988
Shutdown cuBLAS
#---------------------
Data transfer stats:
NUMA 0 -> CUDA 0 137.9982 GB 547.6573 MB/s (transfers : 30791 - avg 4.5893 MB)
CUDA 0 -> NUMA 0 13.4095 GB 53.2167 MB/s (transfers : 2869 - avg 4.7861 MB)
NUMA 0 -> CUDA 1 145.6027 GB 577.8361 MB/s (transfers : 22479 - avg 6.6327 MB)
CUDA 1 -> NUMA 0 12.5233 GB 49.6997 MB/s (transfers : 1732 - avg 7.4041 MB)
CUDA 0 -> CUDA 1 977.9714 GB 3881.1594 MB/s (transfers : 56959 - avg 17.5818 MB)
CUDA 1 -> CUDA 0 956.9282 GB 3797.6476 MB/s (transfers : 56093 - avg 17.4691 MB)
NUMA 0 -> CUDA 2 145.2166 GB 576.3039 MB/s (transfers : 19100 - avg 7.7854 MB)
CUDA 2 -> NUMA 0 7.8284 GB 31.0676 MB/s (transfers : 1835 - avg 4.3685 MB)
CUDA 0 -> CUDA 2 1205.8030 GB 4785.3267 MB/s (transfers : 62539 - avg 19.7436 MB)
CUDA 2 -> CUDA 0 1158.9956 GB 4599.5677 MB/s (transfers : 58094 - avg 20.4292 MB)
CUDA 1 -> CUDA 2 914.6765 GB 3629.9675 MB/s (transfers : 51454 - avg 18.2032 MB)
CUDA 2 -> CUDA 1 1001.9981 GB 3976.5101 MB/s (transfers : 48166 - avg 21.3023 MB)
NUMA 0 -> CUDA 3 150.3286 GB 596.5910 MB/s (transfers : 22012 - avg 6.9933 MB)
CUDA 3 -> NUMA 0 12.7629 GB 50.6505 MB/s (transfers : 2163 - avg 6.0422 MB)
CUDA 0 -> CUDA 3 987.5371 GB 3919.1200 MB/s (transfers : 56623 - avg 17.8591 MB)
CUDA 3 -> CUDA 0 981.4284 GB 3894.8770 MB/s (transfers : 52884 - avg 19.0035 MB)
CUDA 1 -> CUDA 3 1232.1252 GB 4889.7872 MB/s (transfers : 70597 - avg 17.8718 MB)
CUDA 3 -> CUDA 1 1272.3649 GB 5049.4811 MB/s (transfers : 68597 - avg 18.9936 MB)
CUDA 2 -> CUDA 3 1021.7541 GB 4054.9122 MB/s (transfers : 47532 - avg 22.0120 MB)
CUDA 3 -> CUDA 2 1045.3053 GB 4148.3769 MB/s (transfers : 51153 - avg 20.9253 MB)
Total transfers: 13382.5576 GB
#---------------------
Amount of data transfers ar nearly the same, but StarPU-1.4 clearly sends much more data from NUMA 0 to CUDA devices compared to StarPU-1.3. Transfers from CUDA to NUMA are also increased. It influences overall application performance a lot.
Looking at the detail of the platform xml file, I see that the nvswitch is not detected, do you have libnvidia-ml detected? That shows up in the ./configure
output as:
checking whether nvidia-ml should be used... yes
I however also need to add a small piece of code to make it known to the perfmodel. In the meanwhile, you can try to make _starpu_cuda_direct_link
always return 1. Otherwise starpu 1.4 thinks the transfers go through the pci buses (starpu 1.3 doesn't care)
The CUDA: Also detect NVSwitch when checking the number of gpus sharing a bus
commit should be doing it
Comparison of data transfers for starpu-1.3 and starpu-1.4 for the same application with
dm
scheduler:
The fix mentioned above can also fix that case, because we use the performance prediction for selecting the source node for transfers in _starpu_select_src_node
, not only in the scheduler for task placement
It is still strange for me why CUDA 0 is attached to NUMA 1
Before starpu 1.4, we were just using the observed bandwidth to decide where to place the thread driving the gpu, so it might happen that with (mis-)luck, CUDA0 happens to get just a bit more bandwidth from NUMA1.
Starting from starpu 1.4 we use the hwloc information, which is much more stable :)
Nvidia A100 data sheet states 600GB/s
Do you know if there is a programmatic way to get this figure? (other than just measuring by starting transfers from all ends)
Nvidia A100 data sheet states 600GB/s
Do you know if there is a programmatic way to get this figure? (other than just measuring by starting transfers from all ends)
Ah, sorry, you meant the GPU bandwidth itself. I was thinking about the NVSwitch:
In a case of 8x Nvidia A100 we get less than 100GB/s instead of reported 250GB/s.
Do you mean that the total internal bandwidth of the NVSwitch doesn't allow a full 250GB/s for each GPU? Ideally that's the bandwidth I'd like to get access to. Possibly we'll just resort to just measuring it.
Looking at the detail of the platform xml file, I see that the nvswitch is not detected, do you have libnvidia-ml detected? That shows up in the ./configure output as:
Turning off STARPU_SILENT
showed me
[starpu][_starpu_init_cuda_config] Warning: could not find location of CUDA0, do you have the hwloc CUDA plugin installed?
[starpu][_starpu_init_cuda_config] Warning: could not find location of CUDA1, do you have the hwloc CUDA plugin installed?
[starpu][_starpu_init_cuda_config] Warning: could not find location of CUDA2, do you have the hwloc CUDA plugin installed?
[starpu][_starpu_init_cuda_config] Warning: could not find location of CUDA3, do you have the hwloc CUDA plugin installed?
And during configuration:
NVML found and can be compiled, but compiled application can not be run, you are probably on a machine without the CUDA driver
configure: WARNING: nvidia-ml could not be found. This will prevent from correct understanding of the machine topology.
checking whether nvidia-ml should be used... no
I see clearly that the library is present at /usr/lib64. But It is not used somehow.
Could you post the whole config.log?
Surely! config.log
I am using a cluster with SLURM.So, I configure and compile on an access node, which lacks CUDA devices. Probably, it is the reason why nvidia-ml
is marked as not found. It is found at first, and it can be even used for compilation. But, according to config.log, no CUDA devices is found and, therefore, libnvidia-ml is discarded.
I am using a cluster with SLURM
It explains why recompiling the same code on an access mode, which was previously compiled on a compute node, gave totally different results (in one of the posts above).
Seems like I have to compile all the prerequisites (fxt
and hwloc
) on compute nodes to get it work correctly.
Other issue is that I have conda
python package manager installed and configure
script finds hwloc-topo
among conda
files, which is incorrect. I compiled hwloc
with hwloc-calc
and somehow configure
does not find it. Is there a way to point to a correct hwloc-calc
?
The Issue
On a GPU node when switching from StarPU version 1.3.11 to 1.4 versions we experience strange performance drop. For our new software NNTile it results in a 10x performance drop. Yes, it goes from 100% to only 10% percent.
Attempt to switch to a master branch (commit 50cf74508 at Inria gitlab repository) leads to different errors, related to data transfers between CPU and GPU. We tried some other commits from master branch and realized, that they only work with CPU and something strange with memory manager happens when it goes to GPU nodes. DARTS scheduler always fails, while DM and DMDA schedulers fail for some commits (e.g., 50cf74508) and work correctly for other commits (e.g., 2b8a91fe). I cannot present the output of master branch experiments right now, as this current issue is about performance degradation of 1.4 series of StarPU releases.
Although 10x performance drop happens on our new software, I prepared a simple example that shows performance for versions 1.2.10, 1.3.11 and 1.4.4. Most performance drop for the simple example happened when switching from 1.2.10 version to 1.3.11.
Steps to reproduce
I have implemented a simple test https://github.com/Muxas/starpu_gemm_redux to reproduce the issue. The repo simply implements several chains of matrix multiplications:
for
i
in range from0
toD-1
.which can be simply described with the following C code (the first order of task submissions):
or with the following C code (the other order of task submissions):
Matrices
A
are of sizeM-by-K
, matricesB
are of sizeK-by-N
and matrices C are of sizeM-by-N
. No transpositions in matrix multiplications.Our results are produced on a HGX node with 8 (eight) Nvidia A100 80GB SXM GPUs. We compiled the code and run two experimental setups:
M=N=K=1024, D=32, NB=100, R=50.
with and without STARPU_REDUX access mode for the matricesC
.M=256, N=K=1532, D=32, NB=100, R=50.
with and without STARPU_REDUX access mode for the matricesC
.StarPU-1.4.4 behavior
This section presents plots for the StarPU-1.4.4 version. The first plot shows warmup time (done by the first order of task submission), time for the first order of task submission and time for the other way of task submission with STARPU_RW|STARPU_COMMUTE access mode for the matrices
C
andM=N=K=1024
:The second plot shows the same timings but for the STARPU_REDUX access mode for the matrices
C
:The third plot shows timings for
M=256
andN=K=1532
with STARPU_RW|STARPU_COMMUTE access mode:And the last plot in this section (for the STARPU_REDUX access mode):
We see, that most dumb scheduling algorithm, namely
eager
, outperforms smarter ones.StarPU-1.3.11 behavior
This section presents plots for StarPU of version 1.3.11 in the same order as above.
We see, that most dumb scheduling algorithm, namely
eager
, outperforms smarter ones.StarPU-1.2.10 behavior
This section presents plots for StarPU of version 1.2.10 in the same order as above.
Here we see, that in case of STARPU_RW|STARPU_COMMUTE access mode smart schedulers DMDA and DMDAR perform nearly perfectly, just as EAGER. The problem with DMDA and DMDAR appears when switching to 1.3.11 or 1.4.4 StarPU version.
Configuration
The
configure
line we used is within config.log files in the section below.Configuration result
This is a config file for StarPU-1.2.10: config-1.2.10.log
This is a config file for StarPU-1.3.11: config-1.3.11.log
This is a config file for StarPU-1.4.4: config-1.4.4.log
Distribution
Inria Gitlab repository
Version of StarPU
We used starpu-1.3.11 and starpu-1.4.4 tags of Inria GitLab repository
Version of GPU drivers
We use CUDA 12.3, hwloc 2.9.3