turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.45k stars 257 forks source link

convert.py ending with "Killed" at lm_head layer when converting zephyr-7b #179

Closed Christopheraburns closed 2 months ago

Christopheraburns commented 9 months ago

Discussed in https://github.com/turboderp/exllamav2/discussions/178

Originally posted by **Christopheraburns** November 24, 2023 Python 3.10.12 | Ubuntu20.04 | CUDA 11.8 | NVIDIA RTX ADA 4000 (20gb) I am converting the new zephyr-7b (beta) with exLlamav2 with the following syntax: `python exllamav2/convert.py \ -i zephyr-7b-beta \ -o quant \ -c wikitext-test.parquet \ -b 5.0` It seems to quantize fine but at the lm_head layer it ends with "Killed". If I try to load the quantized model with the included test_inference.py I get the error: `ValueError: ## Could not find lm_head.* in model` Which makes sense since given where the process is killed. How can I troubleshoot this further? Thanks!
turboderp commented 2 months ago

Sorry I seem to have not noticed this. It's a system memory issue, same as #504, and the culprit is memory-mapping in safetensors. I'm trying to see if I can rely less on that library when converting models. Track in the other issue.