oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
38.64k stars 5.1k forks source link

Support for MPT 4bit #1877

Closed Nixellion closed 11 months ago

Nixellion commented 1 year ago

Description

Support for loading 4bit quantized MPT models

Additional Context

Occam released it, and added support for loading it to his GPTQ fork and his KoboldAI fork, which may be useful for reference on changes needed to be made.

https://huggingface.co/OccamRazor/mpt-7b-storywriter-4bit-128g

FrederikAbitz commented 1 year ago

I am also interested. The extremely long context opens up new possibilities. I think this would be a really attractive feature to have.

cornpo commented 1 year ago

This was working last night. Broke today right around "superbooga" so I reset to 8c06eeaf8... I don't know if that's where it's broken exactly. But it does work on that commit.

But it won't load quantized https://huggingface.co/OccamRazor/mpt-7b-storywriter-4bit-128g

Traceback (most recent call last): File “~/text-generation-webui/server.py”, line 59, in load_model_wrapper shared.model, shared.tokenizer = load_model(shared.model_name) File “~/text-generation-webui/modules/models.py”, line 159, in load_model model = load_quantized(model_name) File “~/text-generation-webui/modules/GPTQ_loader.py”, line 149, in load_quantized exit() File “~/anaconda3/envs/text/lib/python3.10/_sitebuiltins.py”, line 26, in call raise SystemExit(code) SystemExit: None

Or on latest 9754d6a (w/high cpu on 1 core)...

~/.cache/huggingface/modules/transformers_modules/OccamRazor_mpt-7b-storywriter-4bit-128g/attention.py:148: UserWarning: Using attn_impl: torch. If your model does not use alibi or prefix_lm we recommend using attn_impl: flash otherwise we recommend using attn_impl: triton. warnings.warn('Using attn_impl: torch. If your model does not use alibi or ' + 'prefix_lm we recommend using attn_impl: flash otherwise ' + 'we recommend using attn_impl: triton.') You are using config.init_device='cpu', but you can also use config.init_device="meta" with Composer + FSDP for fast initialization.

Both load https://huggingface.co/mosaicml/mpt-7b-storywriter and santacoder.

ClayShoaf commented 1 year ago

Are you using the --trust-remote-code flag?

silvestron commented 1 year ago

Is it okay to use llama for this model? When I tried to set model type to mpt I got his error:

ERROR:Unknown pre-quantized model type specified. Only 'llama', 'opt' and 'gptj' are supported

When using model type llama it does work, but the output is nonsensical:

Common sense questions and answers

Question: Who's the president of the United States? Factual answer: The President is a man named George Bush, but I'm not sure what he looks like. What do you think?"—and so on," says my friend David Egan (a former colleague from Stony Brook University) in his book The Last Days Of America, which was published by Random House Canada last year."I donned an orange-colored baseball cap with white stripes running down each side; it had been given to me as part-payment for services rendered at one time or another over several years ago – if anyone knows where that guy has gone now would be appreciated! It seems unlikely we'll ever see him again!" In fact there are two things about this story — neither will die out completely until after midnight tonight...but then who cares?! This article may well become obsolete before long anyway because even though they're both dead already - "It isn't going anywhere,' said Darth Vader when asked how much money does she want?'" If your name wasn't

I also get 1.12 tokens/s on a RTX 3060 12GB. The speed is much better using Occam's fork, but the quality is the same.

On top of that it takes 9 minutes to load the model, both with Ooba's webui and Occam's fork of TavenAI.

ClayShoaf commented 1 year ago

@silvestron that's pretty much what I would expect this model to do. It didn't make sense for everyone to get all excited about 65K context size with no knowledge about whether or not the model would actually be coherent. Given the track record of non-llama models, it was unlikely that it would be up to par in that department. As for the slow speeds, I would guess that's a byproduct of the huge context window, even if you're not using the full context, but I could be wrong.

silvestron commented 1 year ago

@jpturcotte Can you run git show inside the text-generation-webui folder to see what commit you're on? How do you get to run it though? I have to manually specify the model type as llama.

NicolasMejiaPetit commented 1 year ago

Is that with the cache set to true or false in the models config file?

Is it okay to use llama for this model? When I tried to set model type to mpt I got his error:


ERROR:Unknown pre-quantized model type specified. Only 'llama', 'opt' and 'gptj' are supported

When using model type llama it does work, but the output is nonsensical:

Common sense questions and answers

Question: Who's the president of the United States?

Factual answer: The President is a man named George Bush, but I'm not sure what he looks like. What do you think?"—and so on," says my friend David Egan (a former colleague from Stony Brook University) in his book The Last Days Of America, which was published by Random House Canada last year."I donned an orange-colored baseball cap with white stripes running down each side; it had been given to me as part-payment for services rendered at one time or another over several years ago – if anyone knows where that guy has gone now would be appreciated! It seems unlikely we'll ever see him again!" In fact there are two things about this story — neither will die out completely until after midnight tonight...but then who cares?! This article may well become obsolete before long anyway because even though they're both dead already - "It isn't going anywhere,' said Darth Vader when asked how much money does she want?'" If your name wasn't

I also get 1.12 tokens/s on a RTX 3060 12GB. The speed is much better using Occam's fork, but the quality is the same.

On top of that it takes 9 minutes to load the model, both with Ooba's webui and Occam's fork of TavenAI.

silvestron commented 1 year ago

@NickWithBotronics "use_cache": false in the config file, I haven't touched it. How much does the webui respect the config file though? It doesn't seem to care about the model type, it always ignores it.

silvestron commented 1 year ago

I replaced git pull with git checkout 85238de in webui.py, but it looks like going back to older commit breaks things. Maybe running a clean install with that commit would be better.

silvestron commented 1 year ago

I'd also add that on a working, up to date installation, I tried using llama, gptj, and opt as model type and gave the same results.

silvestron commented 1 year ago

Are we talking about the 4bit model? That doesn't work if you don't specify a model. #1894 I get the same error if I don't give it a model type.

NicolasMejiaPetit commented 1 year ago

@NickWithBotronics "use_cache": false in the config file, I haven't touched it. How much does the webui respect the config file though? It doesn't seem to care about the model type, it always ignores it.

I was testing out the wizardlm model when it first came out, the cache was set to false I set it to true and got 5x faster responses.

silvestron commented 1 year ago

That actually made the token generation faster, however the initialization time, that on my hardware takes 9 minutes didn't change. The config has "init_device": "cpu" and the console says you can change it to meta for better speed but that didn't work for me. Changing it to cuda didn't work either because it runs out of VRAM (12GB in my case). Maybe with more VRAM the initialization would be faster on GPU.

NicolasMejiaPetit commented 1 year ago

Great I'm also on 12gb of vram,so unless somehow I get mpt7b 4bit working its never running on my gpu. I read somewhere I can use up to 20gb when doing full inference and its got a context window of essentially a book, its even more rough because this model according to my terminal dosent support auto-devices.

jpturcotte commented 1 year ago

@silvestron Darnit, wrong thread! Sorry for getting your hopes up...

silvestron commented 1 year ago

@jpturcotte All good, I couldn't do much with this model anyway without much of VRAM anyway. I guess multi-GPU is going to be the only way to run models that can handle this many tokens. At least for now.

github-actions[bot] commented 11 months ago

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.