vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.24k stars 3.3k forks source link

[Feature]: Use 64-bit integers as indices in cuda kernels #5781

Open courage17340 opened 3 weeks ago

courage17340 commented 3 weeks ago

🚀 The feature, motivation and pitch

I found that some kernels use 32-bit integers as indices, which can easily lead to overflow. I think change them into int64_t (or other 64bit types) will be safer, and should have little impact on performance.

For example, if some tensor's numel >= 2^31,the fp8 quantization will fail. https://github.com/vllm-project/vllm/blob/edd5fe5fa29b8f9cc5fa37a30cc7211e0ff37067/csrc/quantization/fp8/common.cu#L43

Alternatives

No response

Additional context

No response

simon-mo commented 3 weeks ago

@mgoin