Closed Grey4sh closed 4 weeks ago
Thanks a lot for reopening with a lot more information, helps us narrow down the issue much faster.
Okay. This is a won't fix for us. Having odd sized dimensions is an issue in many kernels, and padding is costly and wasting precious GPU ressources (you would essentially by computing 25% too much, not counting the padding op).
Would any of these alternatives work ?
Also using 2x 4A100 should be more efficient in general if it works (less communication overhead between shards).
If you have trouble with your current settings on 4 shards there are some new features on main which should fix everything (not in official release yet, we're ironing a few things)
Get it. Thank you for your nice suggestions.
System Info
TGI version
tgi-2.3.1 docker image
OS version
GPU info
Model being used
Deepseek-coder-V2-instruct-GPTQ quantized with GPTQModel https://github.com/ModelCloud/GPTQModel
quant script
Information
Tasks
Reproduction
docker run script
error message
Expected behavior
Now TGI has supported for GPTQ-quantized MoE models using MoE Marlin. I still encounter some problems when I tried to deploy DeepSeekV2-gptq.
The author of GPTQModel said