Open tgaddair opened 4 months ago
Suspect the issue may be hardware or environment related. Haven't been able to repro on A100 yet.
Regardless, we do need more helpful error messages.
I am seeing this too when testing a qlora adapter tuned from a quantized model!
config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
"lm_head",
],
bias="none",
lora_dropout=0.05, # Conventional
task_type="CAUSAL_LM",
)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
Any failure in SGMV comes back as
Request failed during generation: Server error: No suitable kernel. dtype=Half
From Discord:
Sounds like an error in SGMV kernel that's being swallowed.