oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
39.4k stars 5.18k forks source link

ValueError: MPTForCausalLM does not support `device_map='auto'` yet. #1943

Closed silvacarl2 closed 1 year ago

silvacarl2 commented 1 year ago

Describe the bug

not sure if this is fixable in your code, but here it is:

python server.py --verbose --model-menu --trust-remote-code --load-in-8bit INFO:Gradio HTTP request redirected to localhost :) WARNING:trust_remote_code is enabled. This is dangerous. bin /home/silvacarl/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so INFO:Loading mosaicml_mpt-7b-instruct... /home/silvacarl/.cache/huggingface/modules/transformers_modules/mosaicml_mpt-7b-instruct/attention.py:148: UserWarning: Using attn_impl: torch. If your model does not use alibi or prefix_lm we recommend using attn_impl: flash otherwise we recommend using attn_impl: triton. warnings.warn('Using attn_impl: torch. If your model does not use alibi or ' + 'prefix_lm we recommend using attn_impl: flash otherwise ' + 'we recommend using attn_impl: triton.')

ValueError: MPTForCausalLM does not support device_map='auto' yet.

if this is not fixable in your code, jsut close or delete this. I will also research this issue.

Is there an existing issue for this?

Reproduction

python server.py --verbose --model-menu --trust-remote-code --load-in-8bit INFO:Gradio HTTP request redirected to localhost :) WARNING:trust_remote_code is enabled. This is dangerous. bin /home/silvacarl/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so INFO:Loading mosaicml_mpt-7b-instruct... /home/silvacarl/.cache/huggingface/modules/transformers_modules/mosaicml_mpt-7b-instruct/attention.py:148: UserWarning: Using attn_impl: torch. If your model does not use alibi or prefix_lm we recommend using attn_impl: flash otherwise we recommend using attn_impl: triton. warnings.warn('Using attn_impl: torch. If your model does not use alibi or ' + 'prefix_lm we recommend using attn_impl: flash otherwise ' + 'we recommend using attn_impl: triton.')

ValueError: MPTForCausalLM does not support device_map='auto' yet.

Screenshot

python server.py --verbose --model-menu --trust-remote-code --load-in-8bit INFO:Gradio HTTP request redirected to localhost :) WARNING:trust_remote_code is enabled. This is dangerous. bin /home/silvacarl/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so INFO:Loading mosaicml_mpt-7b-instruct... /home/silvacarl/.cache/huggingface/modules/transformers_modules/mosaicml_mpt-7b-instruct/attention.py:148: UserWarning: Using attn_impl: torch. If your model does not use alibi or prefix_lm we recommend using attn_impl: flash otherwise we recommend using attn_impl: triton. warnings.warn('Using attn_impl: torch. If your model does not use alibi or ' + 'prefix_lm we recommend using attn_impl: flash otherwise ' + 'we recommend using attn_impl: triton.')

ValueError: MPTForCausalLM does not support device_map='auto' yet.

Logs

python server.py --verbose  --model-menu  --trust-remote-code  --load-in-8bit
INFO:Gradio HTTP request redirected to localhost :)
WARNING:trust_remote_code is enabled. This is dangerous.
bin /home/silvacarl/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so
INFO:Loading mosaicml_mpt-7b-instruct...
/home/silvacarl/.cache/huggingface/modules/transformers_modules/mosaicml_mpt-7b-instruct/attention.py:148: UserWarning: Using `attn_impl: torch`. If your model does not use `alibi` or `prefix_lm` we recommend using `attn_impl: flash` otherwise we recommend using `attn_impl: triton`.
  warnings.warn('Using `attn_impl: torch`. If your model does not use `alibi` or ' + '`prefix_lm` we recommend using `attn_impl: flash` otherwise ' + 'we recommend using `attn_impl: triton`.')

ValueError: MPTForCausalLM does not support `device_map='auto'` yet.

System Info

WSl Ubuntu 20.04
dblacknc commented 1 year ago

If I have neither --auto-devices nor --load-in-8bit on the command line, it'll load until it runs out of VRAM on my first 12 GB GPU. Looks like it needs around 14 GB. Either/both of those options - it'll throw the device_map=auto error.

I don't see a way to split it across GPUs at this point, and would like one.

silvacarl2 commented 1 year ago

thx checking that out

Supercabb commented 1 year ago

I fixed the issue in this way: https://github.com/oobabooga/text-generation-webui/issues/1828#issuecomment-1538881613 Works, but need more work to make a merge.

silvacarl2 commented 1 year ago

super cool, will check it out when merged.

just fyi, we are benchmarking these:

SinanAkkoyun/oasst-sft-7-llama-30b databricks/dolly-v2-12b Aeala/GPT4-x-AlpacaDente2-30b NousResearch/gpt4-x-vicuna-13b LLMs/Stable-Vicuna-13B nomic-ai/gpt4all-13b-snoozy togethercomputer/GPT-NeoXT-Chat-Base-20B mosaicml/mpt-7b-instruct mosaicml/mpt-7b-chat TheBloke/koala-13B-HF EleutherAI/pythia-12b mosaicml/mpt-1b-redpajama-200b-dolly stabilityai/stablelm-tuned-alpha-7b TheBloke/wizardLM-7B-HF samwit/koala-7b couchpotato888/alpaca13b OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 THUDM/chatglm-6b stabilityai/stablelm-tuned-alpha-7b TheBloke/wizard-vicuna-13B-HF chaoyi-wu/PMC_LLAMA_7B TheBloke/stable-vicuna-13B-HF

using your API to determine accuracy and resources needed to run as well as response times.

silvacarl2 commented 1 year ago

ok, so this is new:

python server.py --verbose --model-menu --trust-remote-code --load-in-8bit INFO:Gradio HTTP request redirected to localhost :) WARNING:trust_remote_code is enabled. This is dangerous. bin /home/silvacarl/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so INFO:Loading TheBloke_stable-vicuna-13B-HF... ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /mnt/d/text-generation-webui/server.py:885 in │ │ │ │ 882 │ │ update_model_parameters(model_settings, initial=True) # hijacking the command-l │ │ 883 │ │ │ │ 884 │ │ # Load the model │ │ ❱ 885 │ │ shared.model, shared.tokenizer = load_model(shared.model_name) │ │ 886 │ │ if shared.args.lora: │ │ 887 │ │ │ add_lora_to_model(shared.args.lora) │ │ 888 │ │ │ │ /mnt/d/text-generation-webui/modules/models.py:219 in load_model │ │ │ │ 216 │ │ │ │ no_split_module_classes=model._no_split_modules │ │ 217 │ │ │ ) │ │ 218 │ │ │ │ ❱ 219 │ │ model = LoaderClass.from_pretrained(checkpoint, *params) │ │ 220 │ │ │ 221 │ # Hijack attention with xformers │ │ 222 │ if any((shared.args.xformers, shared.args.sdp_attention)): │ │ │ │ /home/silvacarl/.local/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py:471 │ │ in from_pretrained │ │ │ │ 468 │ │ │ ) │ │ 469 │ │ elif type(config) in cls._model_mapping.keys(): │ │ 470 │ │ │ model_class = _get_model_class(config, cls._model_mapping) │ │ ❱ 471 │ │ │ return model_class.from_pretrained( │ │ 472 │ │ │ │ pretrained_model_name_or_path, model_args, config=config, **hub_kwargs, │ │ 473 │ │ │ ) │ │ 474 │ │ raise ValueError( │ │ │ │ /home/silvacarl/.local/lib/python3.8/site-packages/transformers/modeling_utils.py:2740 in │ │ from_pretrained │ │ │ │ 2737 │ │ │ │ │ key: device_map[key] for key in device_map.keys() if key not in modu │ │ 2738 │ │ │ │ } │ │ 2739 │ │ │ │ if "cpu" in device_map_without_lm_head.values() or "disk" in devicemap │ │ ❱ 2740 │ │ │ │ │ raise ValueError( │ │ 2741 │ │ │ │ │ │ """ │ │ 2742 │ │ │ │ │ │ Some modules are dispatched on the CPU or the disk. Make sure yo │ │ 2743 │ │ │ │ │ │ the quantized model. If you want to dispatch the model on the CP │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True and pass a custom device_map to from_pretrained. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details.

any ideas?

silvacarl2 commented 1 year ago

how can i disable auto-devices, like --auto-devices False?

dblacknc commented 1 year ago

Confirmed it now runs with --load-in-8bit

silvacarl2 commented 1 year ago

nice, checking it out!

we are benchmarking both instruct and chat for these:

SinanAkkoyun/oasst-sft-7-llama-30b databricks/dolly-v2-12b Aeala/GPT4-x-AlpacaDente2-30b NousResearch/gpt4-x-vicuna-13b LLMs/Stable-Vicuna-13B nomic-ai/gpt4all-13b-snoozy togethercomputer/GPT-NeoXT-Chat-Base-20B mosaicml/mpt-7b-instruct mosaicml/mpt-7b-chat TheBloke/koala-13B-HF EleutherAI/pythia-12b mosaicml/mpt-1b-redpajama-200b-dolly stabilityai/stablelm-tuned-alpha-7b TheBloke/wizardLM-7B-HF samwit/koala-7b couchpotato888/alpaca13b OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 THUDM/chatglm-6b stabilityai/stablelm-tuned-alpha-7b TheBloke/wizard-vicuna-13B-HF chaoyi-wu/PMC_LLAMA_7B TheBloke/stable-vicuna-13B-HF decapoda-research/llama-13b-hf togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1

we can post results back if anyone is interested.

github-actions[bot] commented 1 year ago

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.