CUPTI PC Sampling (see #294) can only be done from the program that executes the CUDA Kernels itself.
This means that implementing CUPTI support in lo2s is only possible by creating a separate CUPTI sampling support library and using LD_PRELOAD to inject it into the application under measure.
This of course needs some mechanism for the injected library to communicate with lo2s itself, most likely using a ring buffer over shared-memory.
As such a foreign interface might be useful outside of the CUPTI directly, i think this inter-process interface warrants its own discussion.
There are two direct questions:
How should the technical solution look like? shm_open+mmap+own ring buffer implementation, or is there already a turnkey solution for it?
How much genericity should we bake into the design?
CUPTI PC Sampling (see #294) can only be done from the program that executes the CUDA Kernels itself.
This means that implementing CUPTI support in lo2s is only possible by creating a separate CUPTI sampling support library and using LD_PRELOAD to inject it into the application under measure.
This of course needs some mechanism for the injected library to communicate with lo2s itself, most likely using a ring buffer over shared-memory.
As such a foreign interface might be useful outside of the CUPTI directly, i think this inter-process interface warrants its own discussion.
There are two direct questions:
shm_open
+mmap
+own ring buffer implementation, or is there already a turnkey solution for it?