Closed sophosympatheia closed 5 months ago
Here is a config.json diff between one model that has the issue and similar model that doesn't have the issue. The only difference is that the one on the right, the one with the issue, includes a quantization_config block.
{ {
"_name_or_path": "midnight-miqu-70b-v1.5", | "_name_or_path": "/home/llm/mergequant/models/BASE/152334
"architectures": [ | "architectures": [
"LlamaForCausalLM" | "LlamaForCausalLM"
], | ],
"attention_bias": false, | "attention_bias": false,
"attention_dropout": 0.0, | "attention_dropout": 0.0,
"bos_token_id": 1, | "bos_token_id": 1,
"eos_token_id": 2, | "eos_token_id": 2,
"hidden_act": "silu", | "hidden_act": "silu",
"hidden_size": 8192, | "hidden_size": 8192,
"initializer_range": 0.02, | "initializer_range": 0.02,
"intermediate_size": 28672, | "intermediate_size": 28672,
"max_position_embeddings": 32764, | "max_position_embeddings": 32764,
"model_type": "llama", | "model_type": "llama",
"num_attention_heads": 64, | "num_attention_heads": 64,
"num_hidden_layers": 80, | "num_hidden_layers": 80,
"num_key_value_heads": 8, | "num_key_value_heads": 8,
"pad_token_id": 0, | "pad_token_id": 0,
"pretraining_tp": 1, | "pretraining_tp": 1,
"rms_norm_eps": 1e-05, | "rms_norm_eps": 1e-05,
"rope_scaling": null, | "rope_scaling": null,
"rope_theta": 1000000, | "rope_theta": 1000000,
"tie_word_embeddings": false, | "tie_word_embeddings": false,
"torch_dtype": "float16", | "torch_dtype": "float16",
"transformers_version": "4.36.2", | "transformers_version": "4.36.2",
"use_cache": true, | "use_cache": true,
"vocab_size": 32000 | "vocab_size": 32000,
} | "quantization_config": {
> "quant_method": "exl2",
> "version": "0.0.15",
> "bits": 5.0,
> "head_bits": 6,
> "calibration": {
> "rows": 100,
> "length": 2048,
> "dataset": "(default)"
> }
> }
> }
Did you enable autosplit or something?
Did you enable autosplit or something?
Nope. I never use autosplit.
I'm also having this issue. It seems like autosplit is stuck on.
I reverted to commit bde7f00cae8306884c31d855092463ca04ce26ac right after 4-bit cache was added to exl2 because I knew that version was working for me, but the issue was still present.
I then noticed there was another error logged to the console when I selected the model.
Traceback (most recent call last):
File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/gradio/queueing.py", line 407, in call_prediction
output = await route_utils.call_process_api(
File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/gradio/route_utils.py", line 226, in call_process_api
output = await app.get_blocks().process_api(
File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/gradio/blocks.py", line 1550, in process_api
result = await self.call_function(
File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/gradio/blocks.py", line 1185, in call_function
prediction = await anyio.to_thread.run_sync(
File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/gradio/utils.py", line 661, in wrapper
response = f(*args, **kwargs)
File "/home/llm/text-generation-webui/modules/models_settings.py", line 199, in update_model_parameters
value = int(value)
ValueError: invalid literal for int() with base 10: '5.0'
That got me looking more carefully at the new quantization section that the newest version of exllamav2 adds to the config.json file.
"quantization_config": {
"quant_method": "exl2",
"version": "0.0.15",
"bits": 5.0,
"head_bits": 6,
"calibration": {
"rows": 100,
"length": 2048,
"dataset": "(default)"
}
}
Turns out that's the problem. If you remove that section from the config.json, the model loads just fine and Textgen respects the GPU split specified in the UI.
What's interesting is that after loading any model that doesn't produce this problem, then Textgen will successfully load these models with the quantization_config entry until the next time you restart Textgen.
Same error here.
Traceback (most recent call last):███████████████████████████████████████████| 4.83G /4.83G 8.37MiB/s
File "C:\Users\J\text-generation-webui\installer_files\env\Lib\site-packages\gradio\queueing.py", line 407, in call_prediction
output = await route_utils.call_process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\J\text-generation-webui\installer_files\env\Lib\site-packages\gradio\route_utils.py", line 226, in call_process_api
output = await app.get_blocks().process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\J\text-generation-webui\installer_files\env\Lib\site-packages\gradio\blocks.py", line 1550, in process_api
result = await self.call_function(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\J\text-generation-webui\installer_files\env\Lib\site-packages\gradio\blocks.py", line 1185, in call_function
prediction = await anyio.to_thread.run_sync(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\J\text-generation-webui\installer_files\env\Lib\site-packages\anyio\to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\J\text-generation-webui\installer_files\env\Lib\site-packages\anyio\_backends\_asyncio.py", line 2144, in run_sync_in_worker_thread
return await future
^^^^^^^^^^^^
File "C:\Users\J\text-generation-webui\installer_files\env\Lib\site-packages\anyio\_backends\_asyncio.py", line 851, in run
result = context.run(func, *args)
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\J\text-generation-webui\installer_files\env\Lib\site-packages\gradio\utils.py", line 661, in wrapper
response = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "C:\Users\J\text-generation-webui\modules\models_settings.py", line 199, in update_model_parameters
value = int(value)
^^^^^^^^^^
ValueError: invalid literal for int() with base 10: '8.0'
Shockingly, changing "8.0" to an integer (8) fixes the problem. /s But considering many models have fractional bits, you probably shouldn't be using "int()"
Are you guys using version post: https://github.com/turboderp/exllamav2/commit/5fb2c679cb7f81c9811e24ab1362f2436e1b5546
I still haven't hit this problem. What would even read the quantization config? Transformers?
oh.. i see model_settings.py
if 'quantization_config' in metadata:
if 'bits' in metadata['quantization_config']:
model_settings['wbits'] = metadata['quantization_config']['bits']
if 'group_size' in metadata['quantization_config']:
model_settings['groupsize'] = metadata['quantization_config']['group_size']
if 'desc_act' in metadata['quantization_config']:
model_settings['desc_act'] = metadata['quantization_config']['desc_act']
Is it autoselecting the transformers loaders for you? I have seen an issue where it will choose the wrong loader but the one you picked still appears selected. Try flipping between exl2_hf and llama.cpp or something and see if it then respects the settings.
I am getting this issue too with the model I am trying to load. exllamav2 HF wont split across gpus but normal exllamav2 does.
The int() thing should be an obvious fix. You can't read fractional values into an int
problem is GPTQ didn't use fractional bpw and that's what it's for.
Well, EXL2 does, and that's also what it's for
This issue appears to still be screwing up exl2. Any progress?
Just wanted to bump this. Experiencing the same behavior for exl2 models that are not quantized with a whole number/int. (ie, 4.0 works fine, 4.65 results in only GPU0 OOM error)
Describe the bug
Within the past week, I've noticed textgen webui sometimes ignores my GPU split string when loading a model with either ExLlamav2_HF or ExLlamav2. It's not a consistent issue across all models but it is consistent for the models that have the issue. I think so far the models with issues have all been 70B or 103B Miqu merges.
I have made sure that both GPUs are visible to the OS and neither of them are filtered through CUDA_VISIBLE_DEVICES or other directives that would stop textgen from using both cards.
What seems to happen is textgen ignores the second GPU when loading the model. Say I specify a 20,24 GPU split. Textgen should load up 20 GB into Card 0 and then load up the rest into Card 1--typical behavior. Instead, it maxes out Card 0 until it triggers an OOM error without touching Card 1.
I am running the latest version of textgen (git pull shows all up to date) and its dependencies specified in requirements.txt.
I do set
PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync
prior to running server.py, but I have been doing that for about a month without this issue appearing until recently.Is there an existing issue for this?
Reproduction
Attempt to load a Exllamav2 model on a system with two or more GPUs. Textgen ignores the specified GPU split for some models.
Screenshot
No response
Logs
System Info