Open Jimmy9507 opened 5 months ago
After diving deep in the code, look like this line https://github.com/stanford-futuredata/ColBERT/blob/862edcf5ec35fd377ecb8575d753bbefdda463e6/colbert/indexing/codecs/decompress_residuals.cu#L42-L50
restrict cpp method decompress_residuals_cuda
on GPU device 0 only. decompress_residuals_cuda
will crash when running on other GPUs.
After update it to .device(torch::kCUDA, residuals.device().index())
. The crash problem is resolved.
Should we update to .device(torch::kCUDA, residuals.device().index())
? This should also significantly increase the model inferencing efficiency by enabling model inference on multiple GPUs.
Wondering if this is a bug or designed intentionally.
Hey,
I tried to do ColBERT model inferencing via Triton server in multiple GPUs instance.
GPU 0 works fine. However, other GPU devices (1,2,3,... etc) crash when running to this line
D_packed @ Q.to(dtype=D_packed.dtype).T
with no error message.
Did anyone see the same error before?