Closed Ruberik closed 2 years ago
Since I can't figure out how to see the files I tried to attach, I've pasted the output here.
Note that anything that says "Buffers:" can be ignored: It's just there to use the output data, and make sure our various calls don't get optimized out somehow.
10/26/2021 12:35:50 PM Time to allocate and initialize buffers on accelerator #0: 2.1168s
10/26/2021 12:35:50 PM Time to allocate and initialize buffers on accelerator #1: 0.0433s
10/26/2021 12:35:50 PM Time to allocate and initialize buffers on accelerator #2: 0.0440s
10/26/2021 12:35:50 PM Time to allocate and initialize buffers on accelerator #3: 0.0465s
10/26/2021 12:35:50 PM Time to allocate and initialize buffers on accelerator #4: 0.0480s
10/26/2021 12:35:50 PM Time to allocate and initialize buffers on accelerator #5: 0.0482s
10/26/2021 12:35:50 PM Time to allocate and initialize buffers on accelerator #6: 0.0471s
10/26/2021 12:35:50 PM Time to allocate and initialize buffers on accelerator #7: 0.0464s
Buffers:
0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 2 2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 3 3 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 4 4 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 5 5 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 6 6 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 7 7
Reporting Access:
False True False False True True True False
True False True True False False False True
False True False True False True False True
False True True False True False False True
True False False True False True True False
True False True False True False True False
True False False False True True False True
False True True True False False True False
Enabling Access:
0 0 True
0 1 True
0 2 True
0 3 True
0 4 True
0 5 True
0 6 True
0 7 True
1 0 True
1 1 True
1 2 True
1 3 True
1 4 True
1 5 True
1 6 True
1 7 True
2 0 True
2 1 True
2 2 True
2 3 True
2 4 True
2 5 True
2 6 True
2 7 True
3 0 True
3 1 True
3 2 True
3 3 True
3 4 True
3 5 True
3 6 True
3 7 True
4 0 True
4 1 True
4 2 True
4 3 True
4 4 True
4 5 True
4 6 True
4 7 True
5 0 True
5 1 True
5 2 True
5 3 True
5 4 True
5 5 True
5 6 True
5 7 True
6 0 True
6 1 True
6 2 True
6 3 True
6 4 True
6 5 True
6 6 True
6 7 True
7 0 True
7 1 True
7 2 True
7 3 True
7 4 True
7 5 True
7 6 True
7 7 True
Reporting Access:
False True False False True True True False
True False True True False False False True
False True False True False True False True
False True True False True False False True
True False False True False True True False
True False True False True False True False
True False False False True True False True
False True True True False False True False
testing copy:
Rate 0->1: 3,941,520,148 bytes / second
Rate 0->2: 9,902,384,789 bytes / second
Rate 0->3: 9,884,859,033 bytes / second
Rate 0->4: 10,641,553,803 bytes / second
Rate 0->5: 10,639,075,942 bytes / second
Rate 0->6: 8,126,724,309 bytes / second
Rate 0->7: 10,641,189,959 bytes / second
Rate 1->0: 12,568,195,919 bytes / second
Rate 1->2: 10,648,851,740 bytes / second
Rate 1->3: 10,645,267,482 bytes / second
Rate 1->4: 10,636,978,570 bytes / second
Rate 1->5: 10,623,786,223 bytes / second
Rate 1->6: 10,665,590,819 bytes / second
Rate 1->7: 8,131,401,603 bytes / second
Rate 2->0: 12,600,503,484 bytes / second
Rate 2->1: 10,639,523,980 bytes / second
Rate 2->3: 8,128,499,201 bytes / second
Rate 2->4: 10,612,010,802 bytes / second
Rate 2->5: 10,643,990,612 bytes / second
Rate 2->6: 10,642,782,617 bytes / second
Rate 2->7: 10,648,460,997 bytes / second
Rate 3->0: 10,652,248,190 bytes / second
Rate 3->1: 10,636,056,620 bytes / second
Rate 3->2: 8,106,094,765 bytes / second
Rate 3->4: 10,658,746,036 bytes / second
Rate 3->5: 10,638,385,509 bytes / second
Rate 3->6: 10,632,175,649 bytes / second
Rate 3->7: 10,649,691,407 bytes / second
Rate 4->0: 12,602,736,696 bytes / second
Rate 4->1: 10,633,491,808 bytes / second
Rate 4->2: 10,641,986,228 bytes / second
Rate 4->3: 10,646,871,918 bytes / second
Rate 4->5: 8,123,173,777 bytes / second
Rate 4->6: 9,891,456,379 bytes / second
Rate 4->7: 10,615,819,337 bytes / second
Rate 5->0: 10,665,908,655 bytes / second
Rate 5->1: 10,631,391,374 bytes / second
Rate 5->2: 10,647,114,736 bytes / second
Rate 5->3: 10,634,655,563 bytes / second
Rate 5->4: 8,127,311,752 bytes / second
Rate 5->6: 10,626,362,129 bytes / second
Rate 5->7: 10,641,042,320 bytes / second
Rate 6->0: 9,226,817,516 bytes / second
Rate 6->1: 10,625,100,305 bytes / second
Rate 6->2: 10,642,260,468 bytes / second
Rate 6->3: 10,647,262,544 bytes / second
Rate 6->4: 10,640,341,088 bytes / second
Rate 6->5: 10,644,169,989 bytes / second
Rate 6->7: 10,650,018,860 bytes / second
Rate 7->0: 11,536,673,190 bytes / second
Rate 7->1: 8,133,267,868 bytes / second
Rate 7->2: 10,628,144,968 bytes / second
Rate 7->3: 10,639,740,106 bytes / second
Rate 7->4: 10,634,892,558 bytes / second
Rate 7->5: 10,644,497,103 bytes / second
Rate 7->6: 10,649,147,457 bytes / second
Buffers:
0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7
0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7
0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7
0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7
0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7
0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7
0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7
0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7
[Y:\test\simpleP2P.exe] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 8
Checking GPU(s) for support of peer to peer memory access...
> Peer access from Tesla V100-SXM2-32GB (GPU0) -> Tesla V100-SXM2-32GB (GPU1) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU0) -> Tesla V100-SXM2-32GB (GPU2) : No
> Peer access from Tesla V100-SXM2-32GB (GPU0) -> Tesla V100-SXM2-32GB (GPU3) : No
> Peer access from Tesla V100-SXM2-32GB (GPU0) -> Tesla V100-SXM2-32GB (GPU4) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU0) -> Tesla V100-SXM2-32GB (GPU5) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU0) -> Tesla V100-SXM2-32GB (GPU6) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU0) -> Tesla V100-SXM2-32GB (GPU7) : No
> Peer access from Tesla V100-SXM2-32GB (GPU1) -> Tesla V100-SXM2-32GB (GPU0) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU1) -> Tesla V100-SXM2-32GB (GPU2) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU1) -> Tesla V100-SXM2-32GB (GPU3) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU1) -> Tesla V100-SXM2-32GB (GPU4) : No
> Peer access from Tesla V100-SXM2-32GB (GPU1) -> Tesla V100-SXM2-32GB (GPU5) : No
> Peer access from Tesla V100-SXM2-32GB (GPU1) -> Tesla V100-SXM2-32GB (GPU6) : No
> Peer access from Tesla V100-SXM2-32GB (GPU1) -> Tesla V100-SXM2-32GB (GPU7) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU2) -> Tesla V100-SXM2-32GB (GPU0) : No
> Peer access from Tesla V100-SXM2-32GB (GPU2) -> Tesla V100-SXM2-32GB (GPU1) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU2) -> Tesla V100-SXM2-32GB (GPU3) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU2) -> Tesla V100-SXM2-32GB (GPU4) : No
> Peer access from Tesla V100-SXM2-32GB (GPU2) -> Tesla V100-SXM2-32GB (GPU5) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU2) -> Tesla V100-SXM2-32GB (GPU6) : No
> Peer access from Tesla V100-SXM2-32GB (GPU2) -> Tesla V100-SXM2-32GB (GPU7) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU3) -> Tesla V100-SXM2-32GB (GPU0) : No
> Peer access from Tesla V100-SXM2-32GB (GPU3) -> Tesla V100-SXM2-32GB (GPU1) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU3) -> Tesla V100-SXM2-32GB (GPU2) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU3) -> Tesla V100-SXM2-32GB (GPU4) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU3) -> Tesla V100-SXM2-32GB (GPU5) : No
> Peer access from Tesla V100-SXM2-32GB (GPU3) -> Tesla V100-SXM2-32GB (GPU6) : No
> Peer access from Tesla V100-SXM2-32GB (GPU3) -> Tesla V100-SXM2-32GB (GPU7) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU4) -> Tesla V100-SXM2-32GB (GPU0) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU4) -> Tesla V100-SXM2-32GB (GPU1) : No
> Peer access from Tesla V100-SXM2-32GB (GPU4) -> Tesla V100-SXM2-32GB (GPU2) : No
> Peer access from Tesla V100-SXM2-32GB (GPU4) -> Tesla V100-SXM2-32GB (GPU3) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU4) -> Tesla V100-SXM2-32GB (GPU5) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU4) -> Tesla V100-SXM2-32GB (GPU6) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU4) -> Tesla V100-SXM2-32GB (GPU7) : No
> Peer access from Tesla V100-SXM2-32GB (GPU5) -> Tesla V100-SXM2-32GB (GPU0) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU5) -> Tesla V100-SXM2-32GB (GPU1) : No
> Peer access from Tesla V100-SXM2-32GB (GPU5) -> Tesla V100-SXM2-32GB (GPU2) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU5) -> Tesla V100-SXM2-32GB (GPU3) : No
> Peer access from Tesla V100-SXM2-32GB (GPU5) -> Tesla V100-SXM2-32GB (GPU4) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU5) -> Tesla V100-SXM2-32GB (GPU6) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU5) -> Tesla V100-SXM2-32GB (GPU7) : No
> Peer access from Tesla V100-SXM2-32GB (GPU6) -> Tesla V100-SXM2-32GB (GPU0) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU6) -> Tesla V100-SXM2-32GB (GPU1) : No
> Peer access from Tesla V100-SXM2-32GB (GPU6) -> Tesla V100-SXM2-32GB (GPU2) : No
> Peer access from Tesla V100-SXM2-32GB (GPU6) -> Tesla V100-SXM2-32GB (GPU3) : No
> Peer access from Tesla V100-SXM2-32GB (GPU6) -> Tesla V100-SXM2-32GB (GPU4) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU6) -> Tesla V100-SXM2-32GB (GPU5) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU6) -> Tesla V100-SXM2-32GB (GPU7) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU7) -> Tesla V100-SXM2-32GB (GPU0) : No
> Peer access from Tesla V100-SXM2-32GB (GPU7) -> Tesla V100-SXM2-32GB (GPU1) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU7) -> Tesla V100-SXM2-32GB (GPU2) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU7) -> Tesla V100-SXM2-32GB (GPU3) : Yes
> Peer access from Tesla V100-SXM2-32GB (GPU7) -> Tesla V100-SXM2-32GB (GPU4) : No
> Peer access from Tesla V100-SXM2-32GB (GPU7) -> Tesla V100-SXM2-32GB (GPU5) : No
> Peer access from Tesla V100-SXM2-32GB (GPU7) -> Tesla V100-SXM2-32GB (GPU6) : Yes
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 22.52GB/s
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU4: 22.52GB/s
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU5: 44.87GB/s
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU6: 44.87GB/s
cudaMemcpyPeer / cudaMemcpy between GPU1 and GPU0: 22.53GB/s
cudaMemcpyPeer / cudaMemcpy between GPU1 and GPU2: 44.88GB/s
cudaMemcpyPeer / cudaMemcpy between GPU1 and GPU3: 22.53GB/s
cudaMemcpyPeer / cudaMemcpy between GPU1 and GPU7: 44.88GB/s
cudaMemcpyPeer / cudaMemcpy between GPU2 and GPU1: 44.90GB/s
cudaMemcpyPeer / cudaMemcpy between GPU2 and GPU3: 22.53GB/s
cudaMemcpyPeer / cudaMemcpy between GPU2 and GPU5: 44.88GB/s
cudaMemcpyPeer / cudaMemcpy between GPU2 and GPU7: 22.53GB/s
cudaMemcpyPeer / cudaMemcpy between GPU3 and GPU1: 22.53GB/s
cudaMemcpyPeer / cudaMemcpy between GPU3 and GPU2: 22.53GB/s
cudaMemcpyPeer / cudaMemcpy between GPU3 and GPU4: 44.90GB/s
cudaMemcpyPeer / cudaMemcpy between GPU3 and GPU7: 44.91GB/s
cudaMemcpyPeer / cudaMemcpy between GPU4 and GPU0: 22.53GB/s
cudaMemcpyPeer / cudaMemcpy between GPU4 and GPU3: 44.90GB/s
cudaMemcpyPeer / cudaMemcpy between GPU4 and GPU5: 22.53GB/s
cudaMemcpyPeer / cudaMemcpy between GPU4 and GPU6: 44.88GB/s
cudaMemcpyPeer / cudaMemcpy between GPU5 and GPU0: 44.90GB/s
cudaMemcpyPeer / cudaMemcpy between GPU5 and GPU2: 44.91GB/s
cudaMemcpyPeer / cudaMemcpy between GPU5 and GPU4: 22.53GB/s
cudaMemcpyPeer / cudaMemcpy between GPU5 and GPU6: 22.53GB/s
cudaMemcpyPeer / cudaMemcpy between GPU6 and GPU0: 44.88GB/s
cudaMemcpyPeer / cudaMemcpy between GPU6 and GPU4: 44.88GB/s
cudaMemcpyPeer / cudaMemcpy between GPU6 and GPU5: 22.53GB/s
cudaMemcpyPeer / cudaMemcpy between GPU6 and GPU7: 22.53GB/s
cudaMemcpyPeer / cudaMemcpy between GPU7 and GPU1: 44.90GB/s
cudaMemcpyPeer / cudaMemcpy between GPU7 and GPU2: 22.53GB/s
cudaMemcpyPeer / cudaMemcpy between GPU7 and GPU3: 44.90GB/s
cudaMemcpyPeer / cudaMemcpy between GPU7 and GPU6: 22.53GB/s
Test passed
@Ruberik Thank you very much for analyzing this issue in depth. I am trying to reproduce this issue on an NVLink-capable machine next week and let you know the details. I currently believe that the problem might be related to one of our memcpy functions that we leverage from the Cuda API...
hi @Ruberik, I've put up a PR that attempts to fix the copy performance. Are you able to try it out? I don't have NVLink devices to test the behaviour.
Thanks for working on this! It doesn't seem to have worked:
Rate 0->1: 12,172,984,374 bytes / second
Rate 0->2: 11,491,944,864 bytes / second
Rate 0->3: 12,466,995,724 bytes / second
Rate 0->4: 11,477,854,643 bytes / second
Rate 0->5: 11,473,243,221 bytes / second
Rate 0->6: 8,593,898,098 bytes / second
Rate 0->7: 11,452,348,259 bytes / second
Rate 1->0: 13,735,394,786 bytes / second
Rate 1->2: 11,488,551,206 bytes / second
Rate 1->3: 12,470,572,118 bytes / second
Rate 1->4: 11,476,903,847 bytes / second
Rate 1->5: 11,477,511,111 bytes / second
Rate 1->6: 12,468,675,071 bytes / second
Rate 1->7: 8,606,724,590 bytes / second
Rate 2->0: 15,213,968,818 bytes / second
Rate 2->1: 11,486,935,006 bytes / second
Rate 2->3: 8,603,762,559 bytes / second
Rate 2->4: 11,450,552,953 bytes / second
Rate 2->5: 12,471,716,417 bytes / second
Rate 2->6: 11,480,106,511 bytes / second
Rate 2->7: 12,475,215,801 bytes / second
Rate 3->0: 11,488,723,300 bytes / second
Rate 3->1: 12,474,940,416 bytes / second
Rate 3->2: 8,606,145,127 bytes / second
Rate 3->4: 11,451,462,748 bytes / second
Rate 3->5: 11,483,145,171 bytes / second
Rate 3->6: 12,472,339,352 bytes / second
Rate 3->7: 11,482,340,843 bytes / second
Rate 4->0: 13,763,547,491 bytes / second
Rate 4->1: 12,471,694,688 bytes / second
Rate 4->2: 11,482,181,219 bytes / second
Rate 4->3: 12,468,023,546 bytes / second
Rate 4->5: 8,607,863,049 bytes / second
Rate 4->6: 11,454,296,863 bytes / second
Rate 4->7: 11,485,718,546 bytes / second
Rate 5->0: 12,480,305,364 bytes / second
Rate 5->1: 11,480,720,252 bytes / second
Rate 5->2: 12,467,553,043 bytes / second
Rate 5->3: 11,476,357,977 bytes / second
Rate 5->4: 8,597,521,045 bytes / second
Rate 5->6: 11,456,258,355 bytes / second
Rate 5->7: 12,475,926,061 bytes / second
Rate 6->0: 9,830,523,002 bytes / second
Rate 6->1: 11,459,327,206 bytes / second
Rate 6->2: 11,479,363,972 bytes / second
Rate 6->3: 11,474,285,372 bytes / second
Rate 6->4: 12,468,660,591 bytes / second
Rate 6->5: 11,475,548,469 bytes / second
Rate 6->7: 12,473,295,606 bytes / second
Rate 7->0: 12,519,128,021 bytes / second
Rate 7->1: 8,606,083,046 bytes / second
Rate 7->2: 11,454,663,446 bytes / second
Rate 7->3: 11,470,859,245 bytes / second
Rate 7->4: 12,472,266,915 bytes / second
Rate 7->5: 11,476,867,045 bytes / second
Rate 7->6: 12,473,614,390 bytes / second
Clearer results, with less crap. From ILGPU, with #664 incorporated:
Table is in GB/s. Copying one way.
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 11.94 11.30 11.30 11.30 11.30 7.67 11.01
GPU1 11.94 11.31 11.30 11.30 11.30 11.31 7.81
GPU2 12.27 11.30 7.68 11.01 11.31 11.31 11.30
GPU3 11.61 11.31 7.81 11.01 11.31 11.31 11.30
GPU4 12.27 11.30 11.03 11.30 7.81 10.74 11.30
GPU5 11.61 11.30 11.30 11.31 7.67 10.74 11.30
GPU6 8.10 10.74 11.30 11.30 11.30 11.30 11.31
GPU7 11.62 7.81 11.01 11.30 11.30 11.31 11.31
Table is in GB/s. Copying both ways simultaneously. The speed is the one-way speed of the one that finishes last.
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 8.77 8.77 8.43 8.59 8.76 5.30 8.26
GPU1 8.43 8.95 8.59 8.43 8.94 8.77 5.37
GPU2 9.35 8.94 5.31 8.42 9.35 9.54 9.54
GPU3 8.76 8.43 5.37 8.10 8.76 8.77 8.77
GPU4 9.15 8.43 8.94 8.43 5.37 8.26 8.77
GPU5 8.95 8.77 9.53 8.77 5.37 8.77 9.54
GPU6 5.58 8.42 9.53 8.78 8.77 9.34 9.54
GPU7 8.77 5.37 8.94 8.77 8.95 9.34 9.54
From C, where I was improvising more than I was comfortable doing, so the both-way numbers might not be optimal:
Table is in GB/s. Copying one way.
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 22.42 10.47 10.49 22.52 44.86 44.85 10.46
GPU1 22.52 44.89 22.52 10.50 10.54 10.60 44.89
GPU2 10.55 44.89 22.52 10.63 44.89 10.55 22.52
GPU3 10.55 22.52 22.52 44.87 10.54 10.60 44.90
GPU4 22.51 10.47 10.47 44.88 22.52 44.88 10.44
GPU5 44.88 10.50 44.89 10.48 22.52 22.52 10.43
GPU6 44.88 10.49 10.49 10.44 44.88 22.52 22.53
GPU7 10.53 44.89 22.52 44.90 10.54 10.58 22.52
Table is in GB/s. Copying both ways simultaneously. The speed is the one-way speed of the one that finishes last.
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 22.50 7.94 7.68 22.49 44.77 44.76 7.97
GPU1 22.50 44.80 22.49 7.71 7.97 7.96 44.79
GPU2 8.02 44.80 22.50 8.06 44.79 8.60 22.50
GPU3 7.71 22.50 22.50 44.79 7.97 7.96 44.78
GPU4 22.49 7.69 7.95 44.79 22.49 44.77 7.95
GPU5 44.77 8.02 44.80 8.02 22.49 22.49 8.56
GPU6 44.79 8.04 8.55 8.05 44.77 22.49 22.49
GPU7 8.05 44.80 22.50 44.79 8.06 8.56 22.50
The most relevant code bits: ILGPU:
for (int k = 0; k < 100; k++) {
Parallel.Invoke(
() => { buffers[acc1][acc1].CopyTo(streams[acc1], buffers[acc2][acc1]); },
() => { if (copyBothWaysSimultaneously) buffers[acc2][acc2].CopyTo(streams[acc2], buffers[acc1][acc2]); });
}
C:
for (int k = 0; k < 100; k++) {
cudaMemcpyAsync(g0, g1, buf_size, cudaMemcpyDefault, stream0);
if (copy_both_ways_simultaneously) {
cudaMemcpyAsync(g3, g2, buf_size, cudaMemcpyDefault, stream1);
}
checkCudaErrors(cudaSetDevice(gpuid[0]));
checkCudaErrors(cudaStreamSynchronize(stream0));
checkCudaErrors(cudaSetDevice(gpuid[1]));
checkCudaErrors(cudaStreamSynchronize(stream1));
}
@Ruberik Thank you for your efforts and time to investigate this issue 👍 I am currently trying to reproduce your problem on a set of A100 cards. It turns out that (at least...) one problem is related to the fact that the "PeerAccess" between the different accelerators is not properly enabled. However, fixing this issue still results in "strange" numbers being reported by the application you provided.... I continue my investigation, so stay tuned 🚀.
Thanks, @m4rs-mt! I really appreciate the hard work you put into this project, and @MoFtZ as well. Please let me know if you want the full code that spits out a table like the one in my latest message.
@Ruberik I have analyzed the problem in detail and found a solution that fixes this performance issue 🤞. The problem was related to an invalid peer-access accelerator registration (see #675 for more information). First, I reproduced the problem with 2xA100 devices with NVLink capabilities. The Cuda example gives the following output:
// Checking GPU(s) for support of peer to peer memory access...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 243.65GB/s
cudaMemcpyPeer / cudaMemcpy between GPU1 and GPU0: 243.73GB/s
My sample program (see below) written in ILGPU outputs the following information with and without peer-access:
// Without peer access :(
-> GPU0 => GPU1 = 12.644GB/s
-> GPU1 => GPU0 = 12.898GB/s
// With peer access :)
-> GPU0 => GPU1 = 241.002GB/s
-> GPU1 => GPU0 = 241.226GB/s
The program used for benchmarking:
static void Main(string[] args)
{
const long Length = 1024L * 1024L * 16L * sizeof(float);
const int NumRuns = 100;
using var context = Context.Create(builder => builder.Cuda(
device => device.DeviceId < 2));
var accls = new List<CudaAccelerator>(context.Devices.Length);
foreach (var device in context.Devices)
accls.Add(device.CreateAccelerator(context) as CudaAccelerator);
// Enable peer access
for (int i = 0; i < accls.Count; ++i)
{
for (int j = 0; j < accls.Count; ++j)
{
// Skip invalid peer access on the same device
if (i == j)
continue;
bool canAccess = accls[i].CanAccessPeer(accls[j]);
if (!canAccess)
throw new NotSupportedException("Not supported peer config");
// Enable the actual access in both directions
if (!accls[i].EnableBidirectionalPeerAccess(accls[j]))
throw new NotSupportedException("Not supported peer access");
}
}
// Allocate memory on all devices
var buffers = new List<MemoryBuffer1D<byte, Stride1D.Dense>>(accls.Count);
foreach (var accl in accls)
buffers.Add(accl.Allocate1D<byte>(Length));
// Perform the measurements
var stream = accls[0].CreateStream();
var watch = new Stopwatch();
for (int i = 0; i < accls.Count; ++i)
{
for (int j = 0; j < accls.Count; ++j)
{
if (i == j)
continue;
var source = buffers[i];
var target = buffers[j];
watch.Restart();
for (int r = 0; r < NumRuns; ++r)
{
source.CopyTo(stream, target);
}
stream.Synchronize();
watch.Stop();
double gbS = (1.0 / (watch.Elapsed.TotalMilliseconds / 1000.0)) *
((100.0 * Length)) / (1024.0 * 1024.0 * 1024.0);
Console.WriteLine($"-> GPU{i} => GPU{j} = {Math.Round(gbS, 3)}GB/s");
}
}
stream.Dispose();
foreach (var buf in buffers)
buf.Dispose();
foreach (var accl in accls)
accl.Dispose();
}
Awesome! I'm ready to test as soon as I can get a machine with NVLINK on Azure... (Update 6 days later: Still trying several times a day, but we'll run your code exactly, and my own code, when we get one.)
Update 34 days later: I finally have access to a machine with NVSwitch, and this appears to have worked!
I'll preface this by saying that it's probably working sometimes, since otherwise you wouldn't have closed @yurygotham's #378. But I can say fairly confidently that it isn't working on an Azure ND40rs_v2 machine in ILGPU, but it is working from C code.
Here's a snippet of the output of my C# program using ILGPU:
Here's a snippet of the output of my adapted version of NVIDIA's simpleP2P sample:
Code follows. Output is attached.
C# Program
C program
Note that I took NVIDIA's sample code, modified simpleP2P.cu to contain the following, and built it.