turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.53k stars 273 forks source link

Converted 120B model "killed" message appears and exits in Layer0. #321

Closed sat0r1r1 closed 6 months ago

sat0r1r1 commented 8 months ago

Hi, it's my first time using this

When I follow exllamav2/doc/convert.md and encountered problems while trying to quantify the miquella-120b model:

python convert.py \
    -i /mnt/models/alpindale_miquella-120b/ \
    -o /mnt/temp/exl2/ \
    -cf /mnt/models/alpindale_miquella-120b-exl2/3.0bpw/ \
    -b 3.0
 -- Resuming job
 !! Note: Overriding options with settings from existing job
 -- Input: /mnt/models/alpindale_miquella-120b/
 -- Output: /mnt/temp/exl2/
 -- Using default calibration dataset
 -- Target bits per weight: 3.0 (decoder), 6 (head)
 -- Max shard size: 8192 MB
 -- Full model will be compiled to: /mnt/models/alpindale_miquella-120b-exl2/3.0bpw/
 -- Measuring quantization impact...
 -- Layer: model.layers.0 (Attention)
 .
 .
 .
  -- Duration: 35.14 seconds
 -- Layer: model.layers.0 (MLP)
 -- model.layers.0.mlp.gate_proj                       0.05:3b_64g/0.95:2b_64g s4                         2.12 bpw
 -- model.layers.0.mlp.gate_proj                       0.1:3b_64g/0.9:2b_64g s4                           2.16 bpw
 -- model.layers.0.mlp.gate_proj                       0.1:4b_128g/0.9:3b_128g s4                         3.14 bpw
 -- model.layers.0.mlp.gate_proj                       0.1:4b_32g/0.9:3b_32g s4                           3.23 bpw
 -- model.layers.0.mlp.down_proj                       0.05:5b_32g/0.95:3b_32g s4                         3.23 bpw
 -- model.layers.0.mlp.down_proj                       0.05:5b_32g/0.95:4b_32g s4                         4.18 bpw
Killed

English is not my native language, maybe I misunderstood. Is it possible to quantize 120B model on dual 3090 and 128G RAM?

jostack commented 8 months ago

Same configuration with nvlink,

neuron@neuron:~$ nvidia-smi topo -m GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV4 N/A GPU1 NV4 X N/A

I'm also interested....

turboderp commented 8 months ago

A process ending with simply "Killed" usually means you've run out of system memory. Quantization of especially 70B+ models can require a large amount of system RAM. You can try increasing swap space, maybe?

sat0r1r1 commented 8 months ago

A process ending with simply "Killed" usually means you've run out of system memory. Quantization of especially 70B+ models can require a large amount of system RAM. You can try increasing swap space, maybe?

Thanks for reply, I think the problem has been solved. Since I previously set WSL to use up to 8GB of system memory. I'll change the settings and run again! Thank you very much.

mortilla commented 8 months ago

Yes, it is possible to convert 120B model with 3090. I have done this with DiscoLM. I'm only have 32GB of RAM and I don't think I even had the swap enabled. Also, I just used the default settings.