vllm-project / llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
Apache License 2.0
597 stars 49 forks source link

I encountered the same issue: torch._C._LinAlgError: linalg.cholesky #823

Open okwinds opened 2 weeks ago

okwinds commented 2 weeks ago
          I encountered the same issue: 

https://github.com/vllm-project/llm-compressor/issues/109 https://github.com/vllm-project/llm-compressor/issues/142

torch._C._LinAlgError: linalg.cholesky: The factorization could not be completed because the input is not positive-definite (the leading minor of order 17915 is not positive-definite).

  1. Use enough calibration data (≥ 5k samples; model: Qwen2.5-7B-Instruct)
  2. Shuffling data so the samples are in a different order
  3. I adjusted the dampening_frac, which allowed me to complete the model quantization, but running the VLLM inference caused the WSL Ubuntu 22.04 to crash.(No exception information was captured)

In the end, my solution was to switch back to the older version 0.1.0, which resolved the issue, although the quantization process is quite slow.

datasets: belle_resampled_78_k_cn-train ultrachat_200k open-platypus AI-MO_NuminaMath-CoT

Unfortunately, without being able to reproduce the setup, I can't provide more help beyond these few suggestions to try:

  1. Use enough calibration data (≥ 512 samples; if possible, please try 2k, 3k, or 4k as well).
  2. Once you have enough calibration data, try shuffling it so the samples are in a different order.
  3. If steps 1 and 2 don't help, try gradually increasing dampening_frac. Be aware that this should be the last option, as increasing dampening_frac makes your GPTQ algorithms more similar to round-to-nearest quantization, which negatively impacts accuracy.

Originally posted by @okwinds in https://github.com/vllm-project/llm-compressor/issues/142#issuecomment-2395811942

markurtz commented 3 days ago

Hi @okwinds, can you provide the exact sample dataset so we can attempt to reproduce with the Qwen model? The dampening fraction is the correct pathway to trace down for issues like these. Did you test if the model was runnable as is through HuggingFace before vLLM and if it was creating sensible answers? Because it sounds like quantization went through correctly and his may have been a different crash that happened in vLLM and it was unrelated to the cholesky decomposition.