Closed mriganktiwari closed 4 months ago
Also when I use the HF model with replaced BitLinear layers, the generations isn't working.
.generate
with Llama2 model, completes generation in ~68 secondsmodel_name = "meta-llama/Llama-2-7b-hf" #"bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name, token='xxxx')
model = AutoModelForCausalLM.from_pretrained(model_name, token='xxxx')
text = "Tell me about Boxing day significance."
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
start = time.time()
outputs = model.generate(inputs.input_ids, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)
print(f"time for generation: {time.time() - start}")
replace_linears_in_hf(model)
start = time.time()
outputs = model.generate(inputs.input_ids, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)
print(f"time for generation: {time.time() - start}")
I had a quick look at this repo. In the current state of the code, it seems the binarized weights are still floats, which would explain your observation. Also it is still doing weights multiplication instead of some add / subtract, therefore not taking advantage of the replacement of the multiplication operator in bitnet1.58. This being said, performance wise (and potential bugs) apart, the results should be identical to bitnet1.58. Nice to see such attempts!
Stale issue message
Describe the bug When I try to replace BitLinear layer into a HF model (say Llama2-7b-chat), the size is same for both though. Shouldn't size after replacing with BitLinear layer be reduced?
Upvote & Fund