Slowdown of GPU Data Transfers in Python Threads

insertinterestingnamehere commented 3 years ago

Creating this as a placeholder to track progress while we figure out where to even submit this upstream.

Currently GPU transfers in Python threads exhibit unexplained erratic slowdowns. We originally thought these overheads were caused by VECs, however @dialecticDolt did some additional investigation and found that they were caused entirely by use of cudaMemcpy from within threads created by Python. He's verified that this issue does not affect OpenMP's thread pool. We haven't yet verified if this affects threads created using the pthreads interfaces or c++'s std::thread interface, so it is possible that OpenMP is just doing something special instead of Python conflicting with CUDA.

insertinterestingnamehere commented 3 years ago

@dialecticDolt please feel free to add more info here. Where do we have example code to reproduce this?

wlruys commented 3 years ago

I've added the examples to reproduce this with/without VECs in https://github.com/ut-parla/Parla.py/tree/master/benchmarks/gpu_threading, as well as the MPI and CPP OpenMP comparisons.

As a log I'm also copying the performance numbers here (from the slack discussion):

`The reported times for the Memcpy (timed with nvprof) are: Start Time, Duration, Size of Transfer, Transfer Speed, Device Details

On Zemaitis (Openmp in CPP, Allocations and Deallocations done ahead of time, only timing the memcpy)Before warmup: 1.20125s 2.87918s - 7.4506GB 2.5877GB/s Pageable Device Tesla P100-SXM2 4 17 [CUDA memcpy HtoD] 1.47087s 2.79998s - 7.4506GB 2.6609GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD] 1.47157s 2.86431s - 7.4506GB 2.6012GB/s Pageable Device Tesla P100-SXM2 2 35 [CUDA memcpy HtoD] 1.47182s 2.75349s - 7.4506GB 2.7059GB/s Pageable Device Tesla P100-SXM2 3 28 [CUDA memcpy HtoD]

Warmed Up:

19.0741s 1.56790s - 7.4506GB 4.7520GB/s Pageable Device Tesla P100-SXM2 3 27 [CUDA memcpy HtoD] 19.0742s 1.50439s - 7.4506GB 4.9525GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD] 19.0747s 1.46896s - 7.4506GB 5.0720GB/s Pageable Device Tesla P100-SXM2 4 17 [CUDA memcpy HtoD] 19.0767s 1.52812s - 7.4506GB 4.8756GB/s Pageable Device Tesla P100-SXM2 2 37 [CUDA memcpy HtoD]

On Zemaitis (Multithreading in Python, Allocations and Deallocations done with cupy, only timing the memcpy)

14.5530s 2.51998s - 7.4506GB 2.9566GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD] 14.5531s 1.89952s - 7.4506GB 3.9223GB/s Pageable Device Tesla P100-SXM2 2 17 [CUDA memcpy HtoD] 14.5535s 2.10205s - 7.4506GB 3.5444GB/s Pageable Device Tesla P100-SXM2 3 27 [CUDA memcpy HtoD] 14.5537s 1.79802s - 7.4506GB 4.1438GB/s Pageable Device Tesla P100-SXM2 4 37 [CUDA memcpy HtoD]

Just another sample/trial of the same (the first one ^ is on the low end of the variance):

33.5803s 2.30019s - 7.4506GB 3.2391GB/s Pageable Device Tesla P100-SXM2 2 17 [CUDA memcpy HtoD] 33.5807s 2.17479s - 7.4506GB 3.4259GB/s Pageable Device Tesla P100-SXM2 3 27 [CUDA memcpy HtoD] 33.5807s 2.39922s - 7.4506GB 3.1054GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD] 33.5808s 2.40678s - 7.4506GB 3.0957GB/s Pageable Device Tesla P100-SXM2 4 37 [CUDA memcpy HtoD]

On Zemaitis (MPI in Python, Allocations and Deallocations done with cupy, only timing the memcpy)

3.69050s 1.42256s - 7.4506GB 5.2374GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD] 3.78333s 1.43109s - 7.4506GB 5.2062GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD] 3.83678s 1.44673s - 7.4506GB 5.1499GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD] 3.88229s 1.44153s - 7.4506GB 5.1685GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD]

On Frontera (Openmp in CPP, Allocations and Deallocations done ahead of time, only timing the memcpy)

829.65ms 889.74ms - 7.4506GB 8.3739GB/s Pageable Device Quadro RTX 5000 1 7 [CUDA memcpy HtoD] 882.30ms 1.13772s - 7.4506GB 6.5487GB/s Pageable Device Quadro RTX 5000 4 17 [CUDA memcpy HtoD] 997.03ms 1.20776s - 7.4506GB 6.1689GB/s Pageable Device Quadro RTX 5000 2 37 [CUDA memcpy HtoD] 997.41ms 1.21396s - 7.4506GB 6.1374GB/s Pageable Device Quadro RTX 5000 3 27 [CUDA memcpy HtoD]

On Frontera (Multiprocess with MPI in Python, Allocations and Deallocations done with cupy, only timing the Memcpy.)

3.38638s 1.13888s - 7.4506GB 6.5420GB/s Pageable Device Quadro RTX 5000 1 7 [CUDA memcpy HtoD] 3.58987s 1.13825s - 7.4506GB 6.5457GB/s Pageable Device Quadro RTX 5000 1 7 [CUDA memcpy HtoD] 3.69940s 1.21641s - 7.4506GB 6.1251GB/s Pageable Device Quadro RTX 5000 1 7 [CUDA memcpy HtoD] 3.73599s 1.21368s - 7.4506GB 6.1388GB/s Pageable Device Quadro RTX 5000 1 7 [CUDA memcpy HtoD]

On Frontera ( Multithreading in Python, Allocations and Deallocations done with cupy, only timing the Memcpy. )

11.5841s 1.40261s - 7.4506GB 5.3119GB/s Pageable Device Quadro RTX 5000 1 7 [CUDA memcpy HtoD] 11.5843s 1.57895s - 7.4506GB 4.7187GB/s Pageable Device Quadro RTX 5000 2 18 [CUDA memcpy HtoD] 11.5845s 1.64461s - 7.4506GB 4.5303GB/s Pageable Device Quadro RTX 5000 3 29 [CUDA memcpy HtoD] 11.5846s 1.28430s - 7.4506GB 5.8013GB/s Pageable Device Quadro RTX 5000 4 40 [CUDA memcpy HtoD]`

ut-parla / Parla.py

Slowdown of GPU Data Transfers in Python Threads #75