Open LokiWager opened 1 year ago
I couldn't locate the specification for GPU.0 within the code. Where is this detail defined? Although I've searched in
scheduler.c
, it seems to be absent
I mistakenly state that nvshare-scheduler
uses GPU with ID 0 in the README. The scheduler is actually GPU-agnostic. We could use the same program to schedule access to a phone booth and we wouldn't have to change a single line.
The only place where GPU ID 0 is hardcoded is the following:
However, my (untested) understanding is that for a container that uses a single GPU, that GPU always has ID 0 w.r.t. NVML, so this is not a problem.
Does it only support glibc 2.2.5 & glibc 2.34
It supports many versions of glibc and works seamlessly for each one I've tested on. The GLIBC_{225, 234}
shenanigans are to make it work seamlessly across many glibc versions.
See the comment in https://github.com/grgalex/nvshare/blob/9504cdcdcd21c6935f54877da677272e1493f081/src/hook.c:
* Since we're interposing dlsym() in libnvshare, we use dlvsym() to obtain the
* address of the real dlsym function.
*
* Depending on glibc version, we look for the appropriate symbol.
*
* Some context on the implementation:
*
* glibc 2.34 remove the internal __libc_dlsym() symbol that NVIDIA uses in
* their cuHook example:
* https://github.com/phrb/intro-cuda/blob/d38323b81cd799dc09179e2ef27aa8f81b6dac40/src/cuda-samples/7_CUDALibraries/cuHook/libcuhook.cpp#L43
*
* One solution, discussed in apitrace's repo is to use dlvsym(), which also
* takes a version string as a 3rd argument, in order to obtain the real
* dlsym().
*
* This is what user 'manisandro' suggested 8 years ago, when warning about
* using the private __libc_dlsym():
* https://github.com/apitrace/apitrace/issues/258
*
* The maintainer of the repo didn't heed the warning back then, it came back
* 8 years later and bit them.
*
* This is also what user "derhass" suggests:
* https://stackoverflow.com/a/18825060
* (See section "UPDATE FOR 2021/glibc-2.34").
*
* Given all the above, we obtain the real `dlsym()` as such:
* real_dlsym=dlvsym(RTLD_NEXT, "dlsym", "GLIBC_2.2.5");
*
* Since we have to explicitly use a version argument in dlvsym(), we also have
* to define and export two versions of dlsym (hence the linker script.), one
* for each distinct glibc symbol version.
*
*/
@LokiWager
Feel free to open an issue with your suggested plan (it could be similar to what I proposed, it could be radically different) for implementing any of these features.
Then you can prepare a PR and we can take a look together and hopefully merge! :)
I couldn't locate the specification for GPU.0 within the code. Where is this detail defined? Although I've searched in
scheduler.c
, it seems to be absent. The only instances I've noticed are within the initial client setup inclient.c
and resource assignment ink8s-plugin
. Where else might this information be specified?Given the existing architecture, what potential challenges might we face if we were to extend support for multi-GPU? I presume there might be a requirement for a multi-queue
scheduler
, an equitable scheduling algorithm forclient
assignments, and modifications to thek8s-plugin
.Dose it only support
glibc 2.2.5
&glibc 2.34
I look forward to your response. Thank you!