[Feature]: Use 64-bit integers as indices in cuda kernels

🚀 The feature, motivation and pitch

I found that some kernels use 32-bit integers as indices, which can easily lead to overflow. I think change them into int64_t (or other 64bit types) will be safer, and should have little impact on performance.

For example, if some tensor's numel >= 2^31，the fp8 quantization will fail. https://github.com/vllm-project/vllm/blob/edd5fe5fa29b8f9cc5fa37a30cc7211e0ff37067/csrc/quantization/fp8/common.cu#L43

Alternatives

No response

Additional context

No response

vllm-project / vllm

[Feature]: Use 64-bit integers as indices in cuda kernels #5781

🚀 The feature, motivation and pitch

Alternatives

Additional context