Closed zhyncs closed 1 month ago
This verification was completed on Google Colab. T4 is sm75 and has 16G VRAM. Even if it can run, if it's in fp16, it can only run models below 7b, and can't even run Llama 3.1 8B. After disabling FlashInfer, using Triton still results in errors. Trying the nightly version of Triton was also ineffective.
cc @ispobock Perhaps you could help take a look at this issue.
@zhyncs Try to specify the --dtype
as float16
for T4.
ref: https://github.com/state-spaces/mamba/issues/361#issuecomment-2181263738
@zhyncs Try to specify the
--dtype
asfloat16
for T4.
Interesting workaround. I'll support sm75 on FlashInfer. It's just a test branch for now https://github.com/flashinfer-ai/flashinfer/compare/main...sm75
It might be due to bf16. SM75 doesn't support bf16.
fixed with https://github.com/sgl-project/sglang/pull/1233
Thanks @yzh119 's support!
Checklist
Describe the bug
T4 not work w/o FlashInfer ref https://github.com/flashinfer-ai/flashinfer/issues/421
Reproduction
Environment