Open pokerfaceSad opened 12 months ago
@pokerfaceSad Hi, thanks for the feedback!
For the overhead of UVM in and of itself (i.e., when an app runs alone on the system), you can take a look at chapter 11.3 of my diploma thesis [1].
The overhead of the UVM swapping when the GPU lock changes hands, which happens every TQ seconds assuming > 1 apps want to run GPU work, it depends on the PCIe bandwidth and the working set size of the application.
Let's assume a GPU has 32 GB/s PCIe bandwidth and the application that just got the GPU lock uses 32 GB of data, then the UVM swapping overhead is around (2 * 32) / 32 = 2 sec
. We multiply the 32 GB of data by a factor of two to account for the swap-out traffic (data of the previous app) in addition to the swap-in traffic (data of the current app).
You can measure the actual PCIe bandwidth of a GPU by using the bandwidthTest
CUDA sample [2].
[1] https://dspace.lib.ntua.gr/xmlui/handle/123456789/54290 [2] https://github.com/NVIDIA/cuda-samples/tree/master/Samples/1_Utilities/bandwidthTest
Thanks for your detailed reply!
Any ideas about GPU migration? I see it in your Future Improvements.
It seems that it is possible to achieve it by UVM , according to https://dl.acm.org/doi/10.1145/3357223.3362714. Do you have any ideas?
I haven't looked at migration thoroughly yet.
(Though a prerequisite for that is nvshare
support for multiple GPUs per node, which is relatively simple and not implemented yet.)
Are you perhaps interested in taking a look?
If you want to talk about something in private, you can send me an e-mail :)
Sorry for the late reply.
I have sent you an email:)
I think nvshare a nice approach on DL develop scenario!
Has there been any testing on the overhead brought by UVM swap in training scenarios?
BTW, I have posted a solution to address the issue of long GPU idle times in dev scenarios by dynamically mounting the GPU. https://github.com/pokerfaceSad/GPUMounter