Open HaniItani opened 1 year ago
Yeah, seems like a general problem with max_new_tokens=2048
. #73 had the same problem. Try to reduce it.
Thank you for your suggestion @HideLord. I decrease the max_new_tokens
down to 10 since the task is boolean, and the behavior is less severe now but still there. Do you know if this is a problem with alpaca-lora or alpaca in general?
@HaniItani I'm facing the same problem, I believe it's related with Lora, if you remove
model = PeftModel.from_pretrained(model, LORA_WEIGHTS, torch_dtype=torch.float16)
the prediction will ran way faster, maybe theres some steps we have to take like merging the lora model after getting finetuned correctly, but don't take my word for it.
Thank you for your suggestion @felri. Initially I was testing the exported huggingface checkpoint and it was hanging, then I went back to the lora model and had the same behavior.
@felri, I returned max_token_length=512
but loaded the exported huggingface model without quantizing it to 8 bit. The inference still hangs sometimes and the memory consumption goes well up to 33 GB, but its less severe compared to with 8 bit. However if I load the lora model without 8 bit quantization, the problem becomes much more severe.
Can someone with more knowledge confirm that merging the lora model could fix this issue? I think it's the missing piece here @HaniItani
@felri, it does not, I already tried it. I merged the Llama weights with alpaca-lora weights and exported it to a huggingface model using the provided script. Same issue.
@HaniItani did you ran using llama.cpp or alpaca.cpp? I don't think it's going to matter that much but maybe it will run faster? I'm running out of ideas, the finetune.py and generate.py are pretty straightforward, maybe is something else
Did you guys try to stream the output? Maybe it's just never stopping for some reason.
Did you guys try to stream the output? Maybe it's just never stopping for some reason.
Interesting... During the hangs, I noticed GPU usage was at 90% and never went down. Maybe it never stops generating.
@HideLord , I'm not sure what you mean by streaming the output, can you please elaborate? The model seems to be stuck in the forward pass.
@gururise , yes, and the VRAM usage increases significantly too.
I tried inference using the regular alpaca repo and it seems to work fine, slower than this implementation though.
I got a similar issue, and by just printing out output = tokenizer.decode(s)
I find out that it never stops generating until it hits the max_token_length. Typically generating correctly the response but then repeating the instructions and generating nonsense.
I assume it is a problem with not stopping decoding thanks to missing EOS tokens.
I do not understand (yet) how the training handles padding and EOS.
Hello,
Thank you for sharing your work.
I'm interested in evaluating alpaca-lora on QA tasks. I started with BoolQ dataset. I followed the
generate.py
script and constructed a prompt that works for BoolQ dataset. I'm currently doing inference withbatch_size=1
. I noticed that inference sometimes hangs at random for more than 5 minutes for some samples without outputing any error. I can see the GPU utilization during this time increases up to 50GBs on A100 whereas simple normal inference uses 10-11GBs. I encountered the behavior on both A100 and V100 GPUs. I made sure to use the latest version of the code. I attached my code to reproduce this behavior: