Open JIBSIL opened 5 months ago
@JIBSIL What's the generation speed without Unsloth in Kaggle? Also why 150/? Shouldn't it be len(output)/?
@JIBSIL What's the generation speed without Unsloth in Kaggle? Also why 150/? Shouldn't it be len(output)/?
Ah yes, thank you for that; I originally was only generating 150 tokens. However, running the same code now yields this output:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[6], line 3
1 import time
----> 3 FastLanguageModel.for_inference(model) # Enable native 2x faster inference
5 time_initial = time.time()
7 inputs = tokenizer(
8 [
9 alpaca_prompt.format(
(...)
13 )
14 ], return_tensors = "pt").to("cuda")
File /opt/conda/lib/python3.10/site-packages/unsloth/models/llama.py:1604, in FastLlamaModel.for_inference(model)
1601 pass
1603 # Also check if lm_head / embeddings are trained
-> 1604 lm_head = getattr(model, "model", model).lm_head.weight
1605 device_type = lm_head.device.type
1606 dtype = model.config.torch_dtype
File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1688, in Module.__getattr__(self, name)
1686 if name in modules:
1687 return modules[name]
-> 1688 raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'GemmaModel' object has no attribute 'lm_head'
Not sure what the issue could be. When I remove this line:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
it crashes the kernel.
@JIBSIL Whoops my bad - I think I fixed it now!
Great - Seems to be working! But now, the model is spamming the <bos>
token when asked to make a longer generation-
Here's a pastebin of the log as Github won't let me post <bos>
into a code block https://pastes.io/pawpw2voog
However Mistral works perfectly with the same prompt.
Great - Seems to be working! But now, the model is spamming the
<bos>
token when asked to make a longer generation-Here's a pastebin of the log as Github won't let me post
<bos>
into a code block https://pastes.io/pawpw2voogHowever Mistral works perfectly with the same prompt.
I also have the same problem with almost all the models... I have put aside the interest in Unsloth until the community finds the issue and it's fixed.
@HirCoir Apologies sorry was extremely busy this week so didn't have time to look at this! I'll see what I can do! @JIBSIL Also sorry did not respond until now! This is after a finetune right? Did you manage to add EOS during finetuning?
Also @HirCoir you said this happens with non Gemma as well? Did you add EOS to the end of each prompt after finetuning? Are you calling a base model or instruct versions (with or no finetuning?)
@danielhanchen - The issue seems to have fixed itself - looks like it was a case of persisting variables meddling with something in my notebook.
This may be a bit out of scope for the issue, but perhaps you would be able to help. I am testing batch generation using a set of variable length texts, which can have a speedup of over 10x under the correct circumstances. However, it seems to just be generating gibberish. I've tested a few strategies outlined here and in this pr but it appears that the only real resources that exist for this are for GPT-2.
I am initializing like this:
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/gemma-7b-it-bnb-4bit",
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
device_map = "auto"
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
tokenizer.padding_side = "left"
and generating using this code:
inputs = tokenizer(batch, return_tensors="pt", truncation=True, padding='longest').to("cuda")
outputs = model.generate(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'], max_new_tokens=32, do_sample=False)
output = tokenizer.batch_decode(outputs)
which yields this output
<pad><pad>...
...prompt...
### Response:
Sure dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises```
@JIBSIL Fixed batched inference yesterday (after your comment!!) See https://github.com/unslothai/unsloth/issues/267#issuecomment-2034047189 for more info. You'll need to update Unsloth without any dependency updates via pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
for local installations (no need for Colab / Kaggle)
@JIBSIL Fixed batched inference yesterday (after your comment!!) See #267 (comment) for more info. You'll need to update Unsloth without any dependency updates via
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
for local installations (no need for Colab / Kaggle)
Looks like it works! However- when trying to repeatedly generate batches, a CUDA out of memory error eventually occurs:
... notebook runs for over an hour without issues ...
processing entry 46/300
97.97s - 18827tok - 192.16tok/s
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
processing entry 47/300
94.62s - 18313tok - 193.55tok/s
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
processing entry 48/300
97.0s - 18287tok - 188.53tok/s
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
processing entry 49/300
OutOfMemoryError: CUDA out of memory. Tried to allocate 214.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 157.06 MiB is free. Process 2338 has 14.59 GiB memory in use. Of the allocated memory 14.25 GiB is allocated by PyTorch, and 220.69 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
The offending code:
def generate_batch_sample(batch):
time_initial = time.time()
inputs = tokenizer(batch, return_tensors="pt", padding=True).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=192, do_sample=True)
output = tokenizer.batch_decode(outputs)
time_final = time.time()
chars = ""
for generation in output:
chars += f" {generation}"
num_tokens = round(len(chars.split(" ")) * (4/3))
print(f'{round(time_final - time_initial, 2)}s - {num_tokens}tok - {round(num_tokens / (time_final - time_initial), 2)}tok/s')
return output
max_generations = 300 # ~107s per generation, 9 hours
generations = []
with open('/kaggle/working/response_output_intermittent.txt', 'w') as outfile:
for i in range(max_generations):
print(f'processing entry {i}/{max_generations}')
i_generation = generate_batch_sample(batches[i])
torch.cuda.empty_cache()
gc.collect()
generations += i_generation
for generation in i_generation:
outfile.write(f'\n<start_response_sequence>{generation}<end_responce_sequence>')
It's quite strange, because the code worked a few versions ago (albeit on the 2b version; perhaps the use of the 7b version is causing the error). Could there be a GPU memory leak?
@JIBSIL Hmmm could be im not freeing something - let me check
@JIBSIL Much apologies on the delay - its possible there are small memory fragmentations over time, which will cause OOMs, but ye its possibly cause ur doing 7B and not 2B now, and so it gets more pronounced - I don't see a memory leak in my tests for now ie below uses like 20GB of VRAM on a L4:
for _ in range(10):
text = ["Hello my name is"]*128
input_ids = tokenizer(text, return_tensors = "pt").to("cuda")
x = model.generate(**input_ids, max_new_tokens = 16, use_cache = True, do_sample = True)
Then I used nvidia-smi
to check for VRAM usage, and it doesn't look like there are any leaks (for now)
It could be intermittent, though unsure
@JIBSIL Much apologies on the delay - its possible there are small memory fragmentations over time, which will cause OOMs, but ye its possibly cause ur doing 7B and not 2B now, and so it gets more pronounced - I don't see a memory leak in my tests for now ie below uses like 20GB of VRAM on a L4:
for _ in range(10): text = ["Hello my name is"]*128 input_ids = tokenizer(text, return_tensors = "pt").to("cuda") x = model.generate(**input_ids, max_new_tokens = 16, use_cache = True, do_sample = True)
Then I used
nvidia-smi
to check for VRAM usage, and it doesn't look like there are any leaks (for now)It could be intermittent, though unsure
It was actually a problem on my end-- some of the texts in my test set were drastically longer than others. My bad.
Using the Kaggle 2X T4 GPU environment, 7B models have a considerable reduction in speed compared to without unsloth.
Output:
Any ideas of what I'm doing wrong?