GPTQ model seems slow - Githubissues

Ciaranwuk commented 1 year ago

I've been using this chatdocs project with a ggml model which has worked really well if a bit slow. I have read a lot online about GPTQ models delivering significantly better speeds, but when I trialed this it's only getting a roughly 2x speed up.

When I run chatdocs ui command it raises a message "CUDA extension not installed" but I have installed just about every CUDA related package (several of which looked to be CUDA extension) I can find online and the message is still present. Is this likely to be slowing the model down? If so, any idea exactly which package this message is wanting installed?

I'm also getting the message "skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet" but again, I have the triton package installed in my env. Any ideas on a likely cause and if this issue is likely to affect the speed?

Just to round off, I am very pleased with this project in general. it looks good, works nicely and was relatively easy to install (just had to find a few other packages online such as CUDNN)

Ananderz commented 1 year ago

I am getting the exact same problem as you with the CUDA extension not installed.

It is also saying please use the "tie_weights" method before using the infer_auto_device function.

GPTQ is much slower than GGML for me aswell.

Ciaranwuk commented 1 year ago

GPTQ is much slower than GGML for me aswell.

Have you checked your model is small enough to fit on your GPU and run efficiently? I did find it sped up, just not very much. The only time I found GPTQ slower was when I was running a 7GB (13B parameter) model on a 12GB card, because the RAM was being maxed out

Ananderz commented 1 year ago

@Ciaranwuk yup!

I am using wiz-vic 7b uncensored ggml with a gtx 3060 12gb vram

I tried the same model wiz-vic 7b uncensored gptq and it was probably around 4 times slower.

Maybe I don't have the correct settings for GPTQ, I know how to optimize ggml models with batch size, context length etc but I don't know how to use GPTQ models optimized for my card.

I also have NOT figured out how to stream the text generation with gptq, it give me the reply in a chunk!

Got suggestions ?

My GGML prompting on wizvic7b is lightning fast, it prompts in less than a second.

Ciaranwuk commented 1 year ago

Eventually managed to get the speed where I was expecting. turned out I had 2 versions of CUDA (still not sure which packages) running at the same time. I had to update nvcc to match the pytorch installation (11.8) which I got off the pytorch website. The bottom of Issue#21 (of this repo) has a good step by step on the setup.

Very surprised you're getting GGML to run that fast. Have you checked that it is actually drawing from the database? I've found if the database doesn't exist the models run waaaaaay faster, but obviously don't read the documents

marella commented 1 year ago

When I run chatdocs ui command it raises a message "CUDA extension not installed"

If you are seeing this message then it will run very slow. Try installing a prebuilt binary from their releases page:

pip install auto_gptq-0.2.2+cu118-cp310-cp310-win_amd64.whl

I also have NOT figured out how to stream the text generation with gptq, it give me the reply in a chunk!

Only ggml (ctransformers) models support streaming.

Ananderz commented 1 year ago

@Ciaranwuk it is drawing from the database. It's lightning fast with GGML.

@marella thanks for clarification about the streaming! I will probably stick with GGML then! :) the 7b models are so fast. I am trying to find a way to make the 13b models as fast because I have 12GB of VRAM. This is why I have tried GPTQ

abhishekrai43 commented 1 year ago

@Ananderz 7b models are so fast...Can you please share your code? I am stuck here with questioning and answering. I had it set up on different VM , It was working perfect, but before I could save my work that VM was gone. I am using windows server 2022, have the same message that CUDA Extesntion is not installed. I need to find a way to get fast question answer over docs kind of thing and I need to do it fast. If there is just some other script I can use, please share. I have cuda, GPU etc set up and available. I was using it this way earlier, https://stackoverflow.com/questions/76553771/langchain-prints-context-before-question-and-answer, so any variation of the same where I can use the fastest model ggml/gptq doesn't matter , which uses GPU , is all I need.

abhishekrai43 commented 1 year ago

When I run chatdocs ui command it raises a message "CUDA extension not installed"

If you are seeing this message then it will run very slow. Try installing a prebuilt binary from their releases page:
pip install auto_gptq-0.2.2+cu118-cp310-cp310-win_amd64.whl
I also have NOT figured out how to stream the text generation with gptq, it give me the reply in a chunk!

Only ggml (ctransformers) models support streaming.

This cannot be installed on windows server and python 11?, I checked the releases, none match. :(

Ciaranwuk commented 1 year ago

@abhishekrai43 if you had it working before you probably just need to create a new virtual env and reinstall. I also found that I needed my CUDA download (that marella mentioned higher up) needed to match my nvcc installation and that I needed to restart my PC after all that. once I had everything running on CUDA 11.8 and I had restarted, the "CUDA extension not installed" message went away.

abhishekrai43 commented 1 year ago

@Ciaranwuk Thanks for this. Will try

Ciaranwuk commented 1 year ago

I'm closing this now, as I managed to get it working once I got my environment set right

marella / chatdocs

GPTQ model seems slow #49