turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.18k stars 233 forks source link

Convert.py quantization abruptly failing without errors #466

Closed engadine1997 closed 1 month ago

engadine1997 commented 1 month ago

I've been trying to quantize the following 8x7b model (amongst other smaller models), but it keeps abruptly stopping during the process without warning or error messages. I can resume the job repeatedly each time it stops, and that seems to get it to the stage of writing shards to disk, but it always fails after writing 1 or 2 shards. Reducing the shard size from 8192MB to something like 2048 or 1024 seems to write more total GB of the weights to disk, but it always fails prematurely.

For context, my system has 24GB VRAM and 32GB system memory and I'm on the latest 0.0.21 release. I'd assume I might running into OOM issues, but strangely no errors or exceptions appear. I also don't seem to be maxing out either memory either.

(E:\AI\text-generation-webui\installer_files\env) E:\AI\text-generation-webui\exllamav2>python convert.py -i "E:\AI\quantization\input" -o "E:\AI\quantization\working" -cf "E:\AI\quantization\output" -m "E:\AI\quantization\measurement.json" -hb 6 -b 3.75
 -- Resuming job
 !! Note: Overriding options with settings from existing job
 -- Input: E:\AI\quantization\input
 -- Output: E:\AI\quantization\working
 -- Using default calibration dataset
 -- Target bits per weight: 3.75 (decoder), 6 (head)
 -- Max shard size: 8192 MB
 -- Full model will be compiled to: E:\AI\quantization\output
 -- Compiling output file...
 -- Writing shard 1...

(E:\AI\text-generation-webui\installer_files\env) E:\AI\text-generation-webui\exllamav2>

error

Any help would be appreciated, as this has been driving me nuts.

linkage001 commented 1 month ago

Something similar was happening here and it was my system that was unstable. I turned hyper-threading off, decreased the RAM clock and the GPU power limit and it was OK.

engadine1997 commented 1 month ago

I tried lowering power limit to 80% and turning hyperthreading off, but this didn't help. My RAM is already slow because I have XMP disabled (it won't boot with it enabled when I have RAM in all 4 DIMM slots). Maybe it is the RAM, because I've had a lot of trouble with it being unstable the past.

turboderp commented 1 month ago

Abruptly exiting with no error message almost always means you're running out of system memory. It doesn't look like it in your Task Manager, but it could be the result of a single allocation that doesn't have a chance to show on the graph before it kills the process.

Since it happens right as it's compiling the output, I would expect safetensors to allocate an extra 8 GB in one go, at some point during that process. That library is notoriously bad at memory management and I'm probably going to have to come up with some alternative way of producing .safetensors files, but I haven't gotten to it yet.

For the moment, you could try reducing the shard size to 2 GB with -ss 2048. To change the shard size setting in your existing job without restarting it, you can edit job_new.json and change the shard_size key. Other things to try might be upgrading/downgrading the safetensors package or adding a swap file in Windows if you haven't already.

engadine1997 commented 1 month ago

Thanks for the help @turboderp! You were right about it running out of system memory, as it successfully writes all shards now with 64GB RAM.