Exl2 loaders ignore GPU split when quantization_config is present in config.json

sophosympatheia commented 6 months ago

Describe the bug

Within the past week, I've noticed textgen webui sometimes ignores my GPU split string when loading a model with either ExLlamav2_HF or ExLlamav2. It's not a consistent issue across all models but it is consistent for the models that have the issue. I think so far the models with issues have all been 70B or 103B Miqu merges.

I have made sure that both GPUs are visible to the OS and neither of them are filtered through CUDA_VISIBLE_DEVICES or other directives that would stop textgen from using both cards.

What seems to happen is textgen ignores the second GPU when loading the model. Say I specify a 20,24 GPU split. Textgen should load up 20 GB into Card 0 and then load up the rest into Card 1--typical behavior. Instead, it maxes out Card 0 until it triggers an OOM error without touching Card 1.

I am running the latest version of textgen (git pull shows all up to date) and its dependencies specified in requirements.txt.

I do set PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync prior to running server.py, but I have been doing that for about a month without this issue appearing until recently.

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

Attempt to load a Exllamav2 model on a system with two or more GPUs. Textgen ignores the specified GPU split for some models.

Screenshot

No response

Logs

Traceback (most recent call last):

File "/home/llm/text-generation-webui/modules/ui_model_menu.py", line 245, in load_model_wrapper

shared.model, shared.tokenizer = load_model(selected_model, loader)

File "/home/llm/text-generation-webui/modules/models.py", line 87, in load_model

output = load_func_map[loader](model_name)

File "/home/llm/text-generation-webui/modules/models.py", line 380, in ExLlamav2_HF_loader

return Exllamav2HF.from_pretrained(model_name)

File "/home/llm/text-generation-webui/modules/exllamav2_hf.py", line 181, in from_pretrained

return Exllamav2HF(config)

File "/home/llm/text-generation-webui/modules/exllamav2_hf.py", line 50, in init

self.ex_model.load(split)

File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/exllamav2/model.py", line 266, in load

for item in f: x = item

File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/exllamav2/model.py", line 284, in load_gen

module.load()

File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/exllamav2/mlp.py", line 83, in load

self.down_proj.load()

File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/exllamav2/linear.py", line 45, in load

if w is None: w = self.load_weight()

File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/exllamav2/module.py", line 104, in load_weight

qtensors = self.load_multi(key, ["q_weight", "q_invperm", "q_scale", "q_scale_max", "q_groups", "q_perm", "bias"], override_key = override_key)

File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/exllamav2/module.py", line 78, in load_multi

tensors[k] = stfile.get_tensor(key + "." + k, device = self.device())

File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/exllamav2/fasttensors.py", line 118, in get_tensor

return f.get_tensor(key)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 148.00 MiB. GPU 0 has a total capacity of 24.00 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 22.93 GiB is allocated by PyTorch, and 180.81 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

System Info

Windows Subsystem for Linux
WSL version: 2.1.4.0
Kernel version: 5.15.146.1-2
WSLg version: 1.0.60
MSRDC version: 1.2.5105
Direct3D version: 1.611.1-81528511
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.19045.4170

Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:        22.04
Codename:       jammy

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.60.01              Driver Version: 551.76         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:25:00.0 Off |                  N/A |
|  0%   38C    P8              7W /  350W |   23981MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  |   00000000:26:00.0  On |                  N/A |
|  0%   52C    P8             44W /  350W |    1969MiB /  24576MiB |      9%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Python packages installed in my conda environment for textgen
Package                                  Version
---------------------------------------- --------------------
absl-py                                  2.0.0
accelerate                               0.27.2
addict                                   2.4.0
aenum                                    3.1.15
aiofiles                                 23.2.1
aiohttp                                  3.9.3
aiosignal                                1.3.1
alembic                                  1.13.1
aliyun-python-sdk-core                   2.14.0
aliyun-python-sdk-kms                    2.16.2
altair                                   5.1.2
annotated-types                          0.6.0
antlr4-python3-runtime                   4.9.3
anyio                                    3.7.1
appdirs                                  1.4.4
aqlm                                     1.1.0
asgiref                                  3.7.2
asttokens                                2.4.1
async-timeout                            4.0.3
attributedict                            0.3.0
attrs                                    23.2.0
auto-gptq                                0.6.0+cu121
autoawq                                  0.2.3
autoawq_kernels                          0.0.6
backoff                                  2.2.1
basicsr                                  1.4.2
bcrypt                                   4.1.2
beautifulsoup4                           4.12.2
bitsandbytes                             0.43.0
blessings                                1.7
blinker                                  1.7.0
blis                                     0.7.11
boltons                                  23.1.1
build                                    1.1.1
cachetools                               5.3.2
catalogue                                2.0.10
certifi                                  2024.2.2
cffi                                     1.16.0
chardet                                  5.2.0
charset-normalizer                       3.3.2
chroma-hnswlib                           0.7.3
chromadb                                 0.4.24
clean-fid                                0.1.35
click                                    8.1.7
clip                                     1.0
cloudpathlib                             0.16.0
cmake                                    3.27.7
codecov                                  2.1.13
colorama                                 0.4.6
coloredlogs                              15.0.1
colorlog                                 6.8.2
colour-runner                            0.1.1
confection                               0.1.4
contourpy                                1.2.0
controlnet-aux                           0.0.6
coverage                                 7.3.2
cramjam                                  2.7.0
crcmod                                   1.7
cssselect2                               0.7.0
ctransformers                            0.2.27+cu121
cycler                                   0.12.1
cymem                                    2.0.8
DataProperty                             1.0.1
datasets                                 2.18.0
deepdiff                                 6.7.1
Deprecated                               1.2.14
deprecation                              2.1.0
diffusers                                0.23.1
dill                                     0.3.7
diskcache                                5.6.3
distlib                                  0.3.7
distro                                   1.8.0
docker-pycreds                           0.4.0
docopt                                   0.6.2
dynamicprompts                           0.30.4
einops                                   0.7.0
exceptiongroup                           1.1.3
executing                                2.0.1
exllama                                  0.0.18+cu121
exllamav2                                0.0.15+cu121
facexlib                                 0.3.0
fastapi                                  0.104.1
fastparquet                              2023.10.1
ffmpy                                    0.3.1
filelock                                 3.13.1
filterpy                                 1.4.5
flash-attn                               2.5.6
Flask                                    3.0.2
flask-cloudflared                        0.0.14
flatbuffers                              23.5.26
fonttools                                4.44.3
frozenlist                               1.4.1
fsspec                                   2023.10.0
ftfy                                     6.1.1
future                                   0.18.3
fvcore                                   0.1.5.post20221221
gast                                     0.5.4
gdown                                    4.7.1
gekko                                    1.0.6
gfpgan                                   1.3.8
gitdb                                    4.0.11
GitPython                                3.1.40
google-auth                              2.23.4
google-auth-oauthlib                     1.1.0
googleapis-common-protos                 1.63.0
gptq-for-llama                           0.1.1+cu121
gradio                                   3.50.2
gradio_client                            0.6.1
graphviz                                 0.20.1
greenlet                                 3.0.3
grpcio                                   1.59.2
h11                                      0.14.0
hqq                                      0.1.5
httpcore                                 1.0.2
httptools                                0.6.1
httpx                                    0.25.1
huggingface-hub                          0.20.3
humanfriendly                            10.0
icecream                                 2.1.3
idna                                     3.6
importlib-metadata                       6.8.0
importlib-resources                      6.1.1
inflection                               0.5.1
inspecta                                 0.1.3
iopath                                   0.1.9
itsdangerous                             2.1.2
Jinja2                                   3.1.2
jmespath                                 0.10.0
joblib                                   1.3.2
jsonlines                                4.0.0
jsonschema                               4.19.2
jsonschema-specifications                2023.11.1
kiwisolver                               1.4.5
kubernetes                               29.0.0
langcodes                                3.3.0
lazy_loader                              0.3
lightning-utilities                      0.9.0
lit                                      17.0.5
llama_cpp_python                         0.2.56+cpuavx2
llama_cpp_python_cuda                    0.2.56+cu121
llama_cpp_python_cuda_tensorcores        0.2.56+cu121
llvmlite                                 0.42.0
lm-eval                                  0.3.0
lmdb                                     1.4.1
lpips                                    0.1.4
lxml                                     5.1.0
Mako                                     1.3.2
Markdown                                 3.6
markdown-it-py                           3.0.0
MarkupSafe                               2.1.5
matplotlib                               3.8.1
mbstrdecoder                             1.1.3
mdurl                                    0.1.2
mmh3                                     4.1.0
modelscope                               1.9.5
monotonic                                1.6
mpmath                                   1.3.0
multidict                                6.0.5
multiprocess                             0.70.15
murmurhash                               1.0.10
networkx                                 3.2.1
ninja                                    1.11.1.1
nltk                                     3.8.1
num2words                                0.5.13
numba                                    0.59.0
numexpr                                  2.8.7
numpy                                    1.26.4
nvidia-cublas-cu11                       11.10.3.66
nvidia-cublas-cu12                       12.1.3.1
nvidia-cuda-cupti-cu11                   11.7.101
nvidia-cuda-cupti-cu12                   12.1.105
nvidia-cuda-nvrtc-cu11                   11.7.99
nvidia-cuda-nvrtc-cu12                   12.1.105
nvidia-cuda-runtime-cu11                 11.7.99
nvidia-cuda-runtime-cu12                 12.1.105
nvidia-cudnn-cu11                        8.5.0.96
nvidia-cudnn-cu12                        8.9.2.26
nvidia-cufft-cu11                        10.9.0.58
nvidia-cufft-cu12                        11.0.2.54
nvidia-curand-cu11                       10.2.10.91
nvidia-curand-cu12                       10.3.2.106
nvidia-cusolver-cu11                     11.4.0.1
nvidia-cusolver-cu12                     11.4.5.107
nvidia-cusparse-cu11                     11.7.4.91
nvidia-cusparse-cu12                     12.1.0.106
nvidia-nccl-cu11                         2.14.3
nvidia-nccl-cu12                         2.19.3
nvidia-nvjitlink-cu12                    12.3.101
nvidia-nvtx-cu11                         11.7.91
nvidia-nvtx-cu12                         12.1.105
oauthlib                                 3.2.2
onnxruntime                              1.17.1
onnxruntime-gpu                          1.16.3
open-clip-torch                          2.23.0
openai                                   1.3.5
opentelemetry-api                        1.23.0
opentelemetry-exporter-otlp-proto-common 1.23.0
opentelemetry-exporter-otlp-proto-grpc   1.23.0
opentelemetry-instrumentation            0.44b0
opentelemetry-instrumentation-asgi       0.44b0
opentelemetry-instrumentation-fastapi    0.44b0
opentelemetry-proto                      1.23.0
opentelemetry-sdk                        1.23.0
opentelemetry-semantic-conventions       0.44b0
opentelemetry-util-http                  0.44b0
optimum                                  1.17.1
optuna                                   3.5.0
ordered-set                              4.1.0
orjson                                   3.9.15
oss2                                     2.18.3
overrides                                7.7.0
packaging                                23.2
pandas                                   2.2.1
pathvalidate                             3.2.0
peft                                     0.8.2
piexif                                   1.1.3
pillow                                   10.2.0
pip                                      23.2.1
platformdirs                             4.0.0
pluggy                                   1.3.0
portalocker                              2.8.2
posthog                                  2.4.2
preshed                                  3.0.9
protobuf                                 4.23.4
psutil                                   5.9.6
pulsar-client                            3.4.0
py-cpuinfo                               9.0.0
pyarrow                                  14.0.1
pyarrow-hotfix                           0.6
pyasn1                                   0.5.0
pyasn1-modules                           0.3.0
pybind11                                 2.11.1
pycountry                                22.3.5
pycparser                                2.21
pycryptodome                             3.19.0
pydantic                                 2.5.2
pydantic_core                            2.14.5
pydub                                    0.25.1
pyfunctional                             1.4.3
Pygments                                 2.17.2
pyparsing                                3.1.1
PyPika                                   0.48.9
pyproject-api                            1.6.1
pyproject_hooks                          1.0.0
PySocks                                  1.7.1
pytablewriter                            1.2.0
pytextrank                               3.3.0
python-dateutil                          2.8.2
python-dotenv                            1.0.0
python-multipart                         0.0.6
python-slugify                           8.0.1
pytz                                     2023.3.post1
PyWavelets                               1.4.1
PyYAML                                   6.0.1
realesrgan                               0.3.0
referencing                              0.31.0
regex                                    2023.12.25
reportlab                                4.0.7
requests                                 2.31.0
requests-oauthlib                        1.3.1
resize-right                             0.0.2
rich                                     13.7.1
rootpath                                 0.1.1
rouge                                    1.0.1
rouge-score                              0.1.2
rpds-py                                  0.12.0
rsa                                      4.9
sacrebleu                                1.5.0
safetensors                              0.4.2
scikit-learn                             1.3.2
scipy                                    1.12.0
seaborn                                  0.13.0
semantic-version                         2.10.0
Send2Trash                               1.8.2
sentence-transformers                    2.2.2
sentencepiece                            0.2.0
sentry-sdk                               1.37.1
setproctitle                             1.3.3
setuptools                               69.0.2
simplejson                               3.19.2
six                                      1.16.0
slugify                                  0.0.1
smart-open                               6.4.0
smmap                                    5.0.1
sniffio                                  1.3.0
sortedcontainers                         2.4.0
sounddevice                              0.4.6
soupsieve                                2.5
spacy                                    3.7.4
spacy-legacy                             3.0.12
spacy-loggers                            1.0.5
SpeechRecognition                        3.10.0
SQLAlchemy                               2.0.28
sqlitedict                               2.1.0
srsly                                    2.4.8
sse-starlette                            1.6.5
starlette                                0.27.0
svglib                                   1.5.1
sympy                                    1.12
tabledata                                1.3.3
tabulate                                 0.9.0
tb-nightly                               2.16.0a20231115
tcolorpy                                 0.1.4
tenacity                                 8.2.3
tensorboard                              2.16.2
tensorboard-data-server                  0.7.2
termcolor                                2.3.0
text-unidecode                           1.3
texttable                                1.7.0
thinc                                    8.2.3
thop                                     0.1.1.post2209072238
threadpoolctl                            3.2.0
tifffile                                 2023.9.26
tiktoken                                 0.6.0
timm                                     0.9.12
tinycss2                                 1.2.1
tokenizers                               0.15.0
tomesd                                   0.1.3
toml                                     0.10.2
tomli                                    2.0.1
toolz                                    0.12.0
torch                                    2.2.0+cu121
torch-grammar                            0.3.3
torchaudio                               2.2.0+cu121
torchdiffeq                              0.2.3
torchmetrics                             1.2.0
torchsde                                 0.2.6
torchvision                              0.17.0+cu121
tox                                      4.11.4
tqdm                                     4.66.2
tqdm-multiprocess                        0.0.11
trampoline                               0.1.2
transformers                             4.38.2
triton                                   2.2.0
typepy                                   1.3.2
typer                                    0.9.0
typing_extensions                        4.9.0
tzdata                                   2023.3
urllib3                                  2.2.1
uvicorn                                  0.24.0.post1
uvloop                                   0.19.0
virtualenv                               20.24.7
wandb                                    0.16.4
wasabi                                   1.1.2
watchfiles                               0.21.0
wcwidth                                  0.2.10
weasel                                   0.3.4
webencodings                             0.5.1
websocket-client                         1.7.0
websockets                               11.0.3
Werkzeug                                 3.0.1
wheel                                    0.41.2
wrapt                                    1.16.0
xxhash                                   3.4.1
yacs                                     0.1.8
yapf                                     0.40.2
yarl                                     1.9.4
zipp                                     3.17.0
zstandard                                0.22.0

sophosympatheia commented 6 months ago

Here is a config.json diff between one model that has the issue and similar model that doesn't have the issue. The only difference is that the one on the right, the one with the issue, includes a quantization_config block.

{                                                               {
  "_name_or_path": "midnight-miqu-70b-v1.5",                  |     "_name_or_path": "/home/llm/mergequant/models/BASE/152334
  "architectures": [                                          |     "architectures": [
    "LlamaForCausalLM"                                        |         "LlamaForCausalLM"
  ],                                                          |     ],
  "attention_bias": false,                                    |     "attention_bias": false,
  "attention_dropout": 0.0,                                   |     "attention_dropout": 0.0,
  "bos_token_id": 1,                                          |     "bos_token_id": 1,
  "eos_token_id": 2,                                          |     "eos_token_id": 2,
  "hidden_act": "silu",                                       |     "hidden_act": "silu",
  "hidden_size": 8192,                                        |     "hidden_size": 8192,
  "initializer_range": 0.02,                                  |     "initializer_range": 0.02,
  "intermediate_size": 28672,                                 |     "intermediate_size": 28672,
  "max_position_embeddings": 32764,                           |     "max_position_embeddings": 32764,
  "model_type": "llama",                                      |     "model_type": "llama",
  "num_attention_heads": 64,                                  |     "num_attention_heads": 64,
  "num_hidden_layers": 80,                                    |     "num_hidden_layers": 80,
  "num_key_value_heads": 8,                                   |     "num_key_value_heads": 8,
  "pad_token_id": 0,                                          |     "pad_token_id": 0,
  "pretraining_tp": 1,                                        |     "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,                                      |     "rms_norm_eps": 1e-05,
  "rope_scaling": null,                                       |     "rope_scaling": null,
  "rope_theta": 1000000,                                      |     "rope_theta": 1000000,
  "tie_word_embeddings": false,                               |     "tie_word_embeddings": false,
  "torch_dtype": "float16",                                   |     "torch_dtype": "float16",
  "transformers_version": "4.36.2",                           |     "transformers_version": "4.36.2",
  "use_cache": true,                                          |     "use_cache": true,
  "vocab_size": 32000                                         |     "vocab_size": 32000,
}                                                             |     "quantization_config": {
                                                              >         "quant_method": "exl2",
                                                              >         "version": "0.0.15",
                                                              >         "bits": 5.0,
                                                              >         "head_bits": 6,
                                                              >         "calibration": {
                                                              >             "rows": 100,
                                                              >             "length": 2048,
                                                              >             "dataset": "(default)"
                                                              >         }
                                                              >     }
                                                              > }

Ph0rk0z commented 6 months ago

Did you enable autosplit or something?

sophosympatheia commented 6 months ago

Did you enable autosplit or something?

Nope. I never use autosplit.

RedDragonGecko commented 6 months ago

I'm also having this issue. It seems like autosplit is stuck on.

sophosympatheia commented 6 months ago

I reverted to commit bde7f00cae8306884c31d855092463ca04ce26ac right after 4-bit cache was added to exl2 because I knew that version was working for me, but the issue was still present.

I then noticed there was another error logged to the console when I selected the model.

Traceback (most recent call last):
  File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/gradio/queueing.py", line 407, in call_prediction
    output = await route_utils.call_process_api(
  File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/gradio/route_utils.py", line 226, in call_process_api
    output = await app.get_blocks().process_api(
  File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/gradio/blocks.py", line 1550, in process_api
    result = await self.call_function(
  File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/gradio/blocks.py", line 1185, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/gradio/utils.py", line 661, in wrapper
    response = f(*args, **kwargs)
  File "/home/llm/text-generation-webui/modules/models_settings.py", line 199, in update_model_parameters
    value = int(value)
ValueError: invalid literal for int() with base 10: '5.0'

That got me looking more carefully at the new quantization section that the newest version of exllamav2 adds to the config.json file.

 "quantization_config": {
        "quant_method": "exl2",
        "version": "0.0.15",
        "bits": 5.0,
        "head_bits": 6,
        "calibration": {
            "rows": 100,
            "length": 2048,
            "dataset": "(default)"
        }
    }

Turns out that's the problem. If you remove that section from the config.json, the model loads just fine and Textgen respects the GPU split specified in the UI.

What's interesting is that after loading any model that doesn't produce this problem, then Textgen will successfully load these models with the quantization_config entry until the next time you restart Textgen.

oldgithubman commented 6 months ago

Same error here.

Traceback (most recent call last):███████████████████████████████████████████| 4.83G /4.83G  8.37MiB/s
  File "C:\Users\J\text-generation-webui\installer_files\env\Lib\site-packages\gradio\queueing.py", line 407, in call_prediction
    output = await route_utils.call_process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\J\text-generation-webui\installer_files\env\Lib\site-packages\gradio\route_utils.py", line 226, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\J\text-generation-webui\installer_files\env\Lib\site-packages\gradio\blocks.py", line 1550, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\J\text-generation-webui\installer_files\env\Lib\site-packages\gradio\blocks.py", line 1185, in call_function
    prediction = await anyio.to_thread.run_sync(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\J\text-generation-webui\installer_files\env\Lib\site-packages\anyio\to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\J\text-generation-webui\installer_files\env\Lib\site-packages\anyio\_backends\_asyncio.py", line 2144, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "C:\Users\J\text-generation-webui\installer_files\env\Lib\site-packages\anyio\_backends\_asyncio.py", line 851, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\J\text-generation-webui\installer_files\env\Lib\site-packages\gradio\utils.py", line 661, in wrapper
    response = f(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^
  File "C:\Users\J\text-generation-webui\modules\models_settings.py", line 199, in update_model_parameters
    value = int(value)
            ^^^^^^^^^^
ValueError: invalid literal for int() with base 10: '8.0'

oldgithubman commented 6 months ago

Shockingly, changing "8.0" to an integer (8) fixes the problem. /s But considering many models have fractional bits, you probably shouldn't be using "int()"

Ph0rk0z commented 6 months ago

Are you guys using version post: https://github.com/turboderp/exllamav2/commit/5fb2c679cb7f81c9811e24ab1362f2436e1b5546

I still haven't hit this problem. What would even read the quantization config? Transformers?

oh.. i see model_settings.py


            if 'quantization_config' in metadata:
                if 'bits' in metadata['quantization_config']:
                    model_settings['wbits'] = metadata['quantization_config']['bits']
                if 'group_size' in metadata['quantization_config']:
                    model_settings['groupsize'] = metadata['quantization_config']['group_size']
                if 'desc_act' in metadata['quantization_config']:
                    model_settings['desc_act'] = metadata['quantization_config']['desc_act']

Is it autoselecting the transformers loaders for you? I have seen an issue where it will choose the wrong loader but the one you picked still appears selected. Try flipping between exl2_hf and llama.cpp or something and see if it then respects the settings.

Goldenkoron commented 6 months ago

I am getting this issue too with the model I am trying to load. exllamav2 HF wont split across gpus but normal exllamav2 does.

oldgithubman commented 6 months ago

The int() thing should be an obvious fix. You can't read fractional values into an int

Ph0rk0z commented 6 months ago

problem is GPTQ didn't use fractional bpw and that's what it's for.

oldgithubman commented 6 months ago

Well, EXL2 does, and that's also what it's for

oldgithubman commented 5 months ago

This issue appears to still be screwing up exl2. Any progress?

alxfoster commented 5 months ago

Just wanted to bump this. Experiencing the same behavior for exl2 models that are not quantized with a whole number/int. (ie, 4.0 works fine, 4.65 results in only GPU0 OOM error)

Ph0rk0z commented 5 months ago

It's fixed here: https://github.com/oobabooga/text-generation-webui/commit/db5f6cd1d8bd0a27a7318784157049218651470e

oobabooga / text-generation-webui