I get random repetition on large prompts 6000+ token. Or if I do multiple request in parallel I get CUDA: illegal memory access
My guess is that there is something dynamic in the updated awq_marlin kernels.
My hunch (this is untested): #8973 but I am not fully understanding how my non MoE should be affected by this.
Before submitting a new issue...
[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Your current environment
First of all: fantastic project :-) Thank you for everything.
I would like to fix this bug. But I just do not have the capacity now. So I just thought I would try to make a good bug report.
Model Input Dumps
No response
🐛 Describe the bug
If I run this model in
v0.6.2
:All works well and good :-)
If I run it in
v0.6.3
All works well and good with enforce eager :-)
If I drop the
enforce-eager
I get random repetition on large prompts 6000+ token. Or if I do multiple request in parallel I get
CUDA: illegal memory access
My guess is that there is something dynamic in the updated
awq_marlin
kernels.My hunch (this is untested): #8973 but I am not fully understanding how my non MoE should be affected by this.
Before submitting a new issue...