punica-ai / punica

Serving multiple LoRA finetuned LLM as one
https://arxiv.org/abs/2310.18547
Apache License 2.0
883 stars 40 forks source link

why cuda arch should >= 8.0? #42

Closed yyccli closed 4 months ago

yyccli commented 4 months ago

Thanks for this nice work in serving multi loras. The problem is as the title says : )

yzh119 commented 4 months ago

Because the kernel uses some PTX instructions that are available for sm80 or later architectures. Supporting earlier architectures such as sm70/sm75 is feasible (just replacing these instructions with their slower equivalence), but will take some time to implement.

yyccli commented 4 months ago

Thanks for your patience in replying. I'm really new to CUDA now, but i need to try to support punica kernels in sm70 and sm75 archs. I'm using the kernels from vllm, it seems that the kernels in vllm is just bgmv with some minor modifications. In my own project, when i set TORCH_CUDA_ARCH_LIST to 8.0, it works all fine. When setting to 7.0 7.5, i get lots of errors, but they should fall into two categories:

image

one says that for the bfloat16 type, there is no overloaded operator for the += operation

image

another says that the identifier make_bfloat162 is undefined I check that both operators should be supported by cuda_bf16.h, and i search that bfloat16 hardware acceleration only exists for sm80+ archs?(am i right?) so should i just use float in sm70/75 and disable bfloat16? or can you teach how can i find these PTX instructions and replace them?

yzh119 commented 4 months ago

Firstly, we are working on unifying LoRA kernels in punica to FlashInfer (in our release v0.1.0 checklist) where we plan to support sm70/sm75.

Regarding your question, native bfloat16 support is only available for sm80 or later architectures, otherwise you can only use software simulation.

can you teach how can i find these PTX instructions and replace them?

You can check PTX documentation and check the Target ISA Notes of each instruction.

yyccli commented 4 months ago

Thanks again for your reply : )

zhochengbiao commented 3 months ago

e are working on unifying LoRA kernels in punica to FlashInfer (in our release v0.1.0 checklist) whe I have also encountered this issue. May I ask when this support can be provided