Support cudaMemcpyToSymbol and cudaGetSymbolAddress in libcuda

sree314 commented 3 years ago

These are CUDA Runtime API functions, not CUDA Device API functions, so they'll never be supported in libcuda.

Turns out, neither of these have an equivalent in the CUDA Device API. So, I suspect that the CUDA runtime converts these functions to cuMemcpy under the hood along with some sort of ELF symbol lookup and linker magic.

[This is needed for ld_const_s16.cu PTX testcase (among others)]

sree314 commented 3 years ago

Just to clarify: I think these functions should already be supported using cuMemcpy.

bavalpey commented 3 years ago

Trying to replay the trace for ld_const_s16.cu PTX testcase results in an AssertionError

trace

libcudareplay.cuda_device_runtime: INFO: cuInit called from thread 10966 libcudareplay.cuda_device_runtime: WARNING: Function call cuCtxGetDevice failed in trace with error code 201, not calling handler libcudareplay.cuda_device_runtime: INFO: cuModuleGetFunction _Z23ptx_inline_ld_const_b16PtS_ for compute capability (6, 1) harmonv.loader: INFO: Found exact match for arch 61 (type 2) harmonv.loader: INFO: Found exact match for arch 61 (type 2) harmonv.loader: INFO: Found exact PTX match for arch 61 (type 1) libcudareplay.cuda_devices: INFO: Registering 944 bytes as PTX/ELF image 140250303228280 libcudareplay.cuda_devices: INFO: Registering 2408 bytes as PTX/ELF image 140250303287192 libcudareplay.cuda_devices: INFO: Registering 576 bytes as PTX/ELF image 140250303349312 libcudareplay.cuda_device_runtime: INFO: Using ELF: and PTX: for _Z23ptx_inline_ld_const_b16PtS_ libcudareplay.libcuda_replay: WARNING: No handler for cuModuleGetGlobal_v2_post found libcudareplay.cuda_device_runtime: INFO: cuMemAlloc on device 0: 2 bytes at 0x7f29f0a00000 [cuMemAlloc] 0x7f29f0a00000,2 [cuMemcpyHtoD] 0x7f29f0a00000,0x557792138de0,2,0x65 0x65 libcudareplay.libcuda_replay: WARNING: No handler for cuModuleGetGlobal_v2_post found Traceback (most recent call last): File "./demorunner.py", line 92, in tr.replay() File "/localdisk/bvalpey/projects/ROCetta/gpu-api-interposer/libcuda-replay/libcudareplay/tracerunner.py", line 157, in replay self.replayer.replay(self.trace_handler) File "/localdisk/bvalpey/projects/ROCetta/gpu-api-interposer/libcuda-replay/libcudareplay/libcuda_replay.py", line 299, in replay hf(event, blobstore_data) File "/localdisk/bvalpey/projects/ROCetta/gpu-api-interposer/libcuda-replay/libcudareplay/libcuda_replay.py", line 177, in cuMemcpyHtoD_v2_pre ev['ByteCount'], data) File "/localdisk/bvalpey/projects/ROCetta/gpu-api-interposer/libcuda-replay/libcudareplay/cuda_device_runtime.py", line 44, in checker return f(self, *args, **kwargs) File "/localdisk/bvalpey/projects/ROCetta/gpu-api-interposer/libcuda-replay/libcudareplay/cuda_device_runtime.py", line 266, in cuMemcpyHtoD assert gpu.has_dptr(dstDevice, ByteCount) AssertionError

sree314 commented 3 years ago

What value does dstDevice contain?

The assert is triggered because the address cuMemcpyHtoD is trying to copy to is not one that the library has seen in the trace (e.g. through a cuMemAlloc)

sree314 commented 3 years ago

Turns out we may need to implement cuModuleGetGlobal_v2

sree314 commented 3 years ago

I've pushed an implementation of cuModuleGetGlobal_v2 to the getsymbol branch.

pyxis-roc / gpu-api-interposer

Support cudaMemcpyToSymbol and cudaGetSymbolAddress in libcuda #11