stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
3.06k stars 388 forks source link

GPU crashes when running "D_packed @ Q.to(dtype=D_packed.dtype).T" with no error message #348

Open Jimmy9507 opened 5 months ago

Jimmy9507 commented 5 months ago

Hey,

I tried to do ColBERT model inferencing via Triton server in multiple GPUs instance.

GPU 0 works fine. However, other GPU devices (1,2,3,... etc) crash when running to this line

D_packed @ Q.to(dtype=D_packed.dtype).T

with no error message.

Did anyone see the same error before?

Jimmy9507 commented 5 months ago

After diving deep in the code, look like this line https://github.com/stanford-futuredata/ColBERT/blob/862edcf5ec35fd377ecb8575d753bbefdda463e6/colbert/indexing/codecs/decompress_residuals.cu#L42-L50

restrict cpp method decompress_residuals_cuda on GPU device 0 only. decompress_residuals_cuda will crash when running on other GPUs.

After update it to .device(torch::kCUDA, residuals.device().index()). The crash problem is resolved.

Should we update to .device(torch::kCUDA, residuals.device().index())? This should also significantly increase the model inferencing efficiency by enabling model inference on multiple GPUs.

Wondering if this is a bug or designed intentionally.