Questions regarding multi-GPU

LokiWager commented 1 year ago

I couldn't locate the specification for GPU.0 within the code. Where is this detail defined? Although I've searched in scheduler.c, it seems to be absent. The only instances I've noticed are within the initial client setup in client.c and resource assignment in k8s-plugin. Where else might this information be specified?
Given the existing architecture, what potential challenges might we face if we were to extend support for multi-GPU? I presume there might be a requirement for a multi-queue scheduler, an equitable scheduling algorithm for client assignments, and modifications to the k8s-plugin.
Dose it only support glibc 2.2.5 & glibc 2.34

I look forward to your response. Thank you!

grgalex commented 1 year ago

I couldn't locate the specification for GPU.0 within the code. Where is this detail defined? Although I've searched in scheduler.c, it seems to be absent

I mistakenly state that nvshare-scheduler uses GPU with ID 0 in the README. The scheduler is actually GPU-agnostic. We could use the same program to schedule access to a phone booth and we wouldn't have to change a single line.

The only place where GPU ID 0 is hardcoded is the following:

https://github.com/grgalex/nvshare/blob/9504cdcdcd21c6935f54877da677272e1493f081/src/client.c#L385

However, my (untested) understanding is that for a container that uses a single GPU, that GPU always has ID 0 w.r.t. NVML, so this is not a problem.

grgalex commented 1 year ago

Does it only support glibc 2.2.5 & glibc 2.34

It supports many versions of glibc and works seamlessly for each one I've tested on. The GLIBC_{225, 234} shenanigans are to make it work seamlessly across many glibc versions.

See the comment in https://github.com/grgalex/nvshare/blob/9504cdcdcd21c6935f54877da677272e1493f081/src/hook.c:

 * Since we're interposing dlsym() in libnvshare, we use dlvsym() to obtain the
 * address of the real dlsym function.
 *
 * Depending on glibc version, we look for the appropriate symbol.
 *
 * Some context on the implementation:
 *
 * glibc 2.34 remove the internal __libc_dlsym() symbol that NVIDIA uses in
 * their cuHook example:
 * https://github.com/phrb/intro-cuda/blob/d38323b81cd799dc09179e2ef27aa8f81b6dac40/src/cuda-samples/7_CUDALibraries/cuHook/libcuhook.cpp#L43
 *
 * One solution, discussed in apitrace's repo is to use dlvsym(), which also
 * takes a version string as a 3rd argument, in order to obtain the real
 * dlsym().
 * 
 * This is what user 'manisandro' suggested 8 years ago, when warning about
 * using the private __libc_dlsym():
 * https://github.com/apitrace/apitrace/issues/258
 * 
 * The maintainer of the repo didn't heed the warning back then, it came back
 * 8 years later and bit them.
 * 
 * This is also what user "derhass" suggests:
 * https://stackoverflow.com/a/18825060
 * (See section "UPDATE FOR 2021/glibc-2.34").
 * 
 * Given all the above, we obtain the real `dlsym()` as such:
 * real_dlsym=dlvsym(RTLD_NEXT, "dlsym", "GLIBC_2.2.5");
 *
 * Since we have to explicitly use a version argument in dlvsym(), we also have
 * to define and export two versions of dlsym (hence the linker script.), one
 * for each distinct glibc symbol version.
 *
 */

grgalex commented 1 year ago

@LokiWager

Feel free to open an issue with your suggested plan (it could be similar to what I proposed, it could be radically different) for implementing any of these features.

Then you can prepare a PR and we can take a look together and hopefully merge! :)

grgalex / nvshare

Questions regarding multi-GPU #8