Closed Jake36921 closed 1 year ago
How did they encode this model? Did they use act order + group size and you are trying to use it on cuda kernel?
oh.. I see it used the 0cc4m gptq.. it works on my fork but very very sloooooow.. autograd fails on half/float error :(
I get the same error with a pygmalion model. Also a safetensor if that matters.
I get a similar one for the facebook/galactica-125m on an Intel Mac.
Maybe edit the config and try removing "torch_dtype": "float16". See if anything helps to change it from false to true or true to false.
After setting "use_cache": true, I finally get usable output from this model but only using the https://github.com/johnsmith0031/alpaca_lora_4bit inference. Regular GPTQ is 1/2 that speed, under 1it/s, even with no context.
Output generated in 11.76 seconds (2.21 tokens/s, 26 tokens, context 611, seed 492669332)
This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.
Describe the bug
Tried to generate response but no output generated.
Is there an existing issue for this?
Reproduction
Arguments: call python server.py --chat --model-dir models --cpu --wbits 4 --groupsize 128 run bat file, wait for the model to load, and then click generate.
Screenshot
No response
Logs
System Info