predibase / lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
https://loraexchange.ai
Apache License 2.0
1.93k stars 128 forks source link

Improve error handling in SGMV kernels #322

Open tgaddair opened 4 months ago

tgaddair commented 4 months ago

Any failure in SGMV comes back as Request failed during generation: Server error: No suitable kernel. dtype=Half

From Discord:

I have tried the finetune adapter for llama2-7b. I trained model on predibase page. I downloaded adapter and places on https://huggingface.co/marekk/Lemma-Llama-2-7b-Adapter/tree/main. Now I am training load this adapter on llama2-7b but quantized. My args are: [ "--model-id", "meta-llama/Llama-2-7b-hf", "--quantize", "bitsandbytes-fp4", "--max-batch-prefill-tokens", "1024"]. Model without adapter works fine, but when I try to use adapter I get Request failed during generation: Server error: No suitable kernel. dtype=Half. Is there any way to use adapter on quantized version of model?

Sounds like an error in SGMV kernel that's being swallowed.

tgaddair commented 4 months ago

Suspect the issue may be hardware or environment related. Haven't been able to repro on A100 yet.

Regardless, we do need more helpful error messages.

SamComber commented 3 months ago

I am seeing this too when testing a qlora adapter tuned from a quantized model!

config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
image