Nice approach on DL dev scenario

pokerfaceSad commented 12 months ago

I think nvshare a nice approach on DL develop scenario!

Has there been any testing on the overhead brought by UVM swap in training scenarios?

BTW, I have posted a solution to address the issue of long GPU idle times in dev scenarios by dynamically mounting the GPU. https://github.com/pokerfaceSad/GPUMounter

grgalex commented 12 months ago

@pokerfaceSad Hi, thanks for the feedback!

For the overhead of UVM in and of itself (i.e., when an app runs alone on the system), you can take a look at chapter 11.3 of my diploma thesis [1].

The overhead of the UVM swapping when the GPU lock changes hands, which happens every TQ seconds assuming > 1 apps want to run GPU work, it depends on the PCIe bandwidth and the working set size of the application.

Simple Example

Let's assume a GPU has 32 GB/s PCIe bandwidth and the application that just got the GPU lock uses 32 GB of data, then the UVM swapping overhead is around (2 * 32) / 32 = 2 sec. We multiply the 32 GB of data by a factor of two to account for the swap-out traffic (data of the previous app) in addition to the swap-in traffic (data of the current app).

You can measure the actual PCIe bandwidth of a GPU by using the bandwidthTest CUDA sample [2].

[1] https://dspace.lib.ntua.gr/xmlui/handle/123456789/54290 [2] https://github.com/NVIDIA/cuda-samples/tree/master/Samples/1_Utilities/bandwidthTest

pokerfaceSad commented 12 months ago

Thanks for your detailed reply!

Any ideas about GPU migration? I see it in your Future Improvements.

It seems that it is possible to achieve it by UVM , according to https://dl.acm.org/doi/10.1145/3357223.3362714. Do you have any ideas?

grgalex commented 12 months ago

I haven't looked at migration thoroughly yet.

(Though a prerequisite for that is nvshare support for multiple GPUs per node, which is relatively simple and not implemented yet.)

Are you perhaps interested in taking a look?

If you want to talk about something in private, you can send me an e-mail :)

pokerfaceSad commented 11 months ago

Sorry for the late reply.

I have sent you an email:)

grgalex / nvshare

Nice approach on DL dev scenario #12

Simple Example