unslothai / unsloth

Finetune Llama 3.1, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
14.73k stars 972 forks source link

Slow Kaggle Performance (2x T4) #260

Open JIBSIL opened 5 months ago

JIBSIL commented 5 months ago

Using the Kaggle 2X T4 GPU environment, 7B models have a considerable reduction in speed compared to without unsloth.

from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# use mistral 7B for this section

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    device_map = "auto"
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

import time

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

time_initial = time.time()

inputs = tokenizer(
[
    alpaca_prompt.format(
        "Generate a way to transform text in accordance with the input.", # instruction
        "Make 100 creative ways to transform text, and number them. For example: \"Convert this into a sea shanty\" or \"Rewrite this text as a presidential speech\". Do not generate fewer than 100 texts. Only mention transformations related to the text.", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=2048, use_cache=True, do_sample=True)
output = tokenizer.batch_decode(outputs)

print(output)

time_final = time.time()
print(f'{round(time_final - time_initial, 2)}s - {round(150 / (time_final - time_initial), 2)}tok/s')

Output:

...generated text...
58.29s - 2.57tok/s

Any ideas of what I'm doing wrong?

danielhanchen commented 5 months ago

@JIBSIL What's the generation speed without Unsloth in Kaggle? Also why 150/? Shouldn't it be len(output)/?

JIBSIL commented 5 months ago

@JIBSIL What's the generation speed without Unsloth in Kaggle? Also why 150/? Shouldn't it be len(output)/?

Ah yes, thank you for that; I originally was only generating 150 tokens. However, running the same code now yields this output:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[6], line 3
      1 import time
----> 3 FastLanguageModel.for_inference(model) # Enable native 2x faster inference
      5 time_initial = time.time()
      7 inputs = tokenizer(
      8 [
      9     alpaca_prompt.format(
   (...)
     13     )
     14 ], return_tensors = "pt").to("cuda")

File /opt/conda/lib/python3.10/site-packages/unsloth/models/llama.py:1604, in FastLlamaModel.for_inference(model)
   1601 pass
   1603 # Also check if lm_head / embeddings are trained
-> 1604 lm_head = getattr(model, "model", model).lm_head.weight
   1605 device_type = lm_head.device.type
   1606 dtype = model.config.torch_dtype

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1688, in Module.__getattr__(self, name)
   1686     if name in modules:
   1687         return modules[name]
-> 1688 raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")

AttributeError: 'GemmaModel' object has no attribute 'lm_head'

Not sure what the issue could be. When I remove this line: FastLanguageModel.for_inference(model) # Enable native 2x faster inference it crashes the kernel.

danielhanchen commented 5 months ago

@JIBSIL Whoops my bad - I think I fixed it now!

JIBSIL commented 5 months ago

Great - Seems to be working! But now, the model is spamming the <bos> token when asked to make a longer generation-

Here's a pastebin of the log as Github won't let me post <bos> into a code block https://pastes.io/pawpw2voog

However Mistral works perfectly with the same prompt.

HirCoir commented 5 months ago

Great - Seems to be working! But now, the model is spamming the <bos> token when asked to make a longer generation-

Here's a pastebin of the log as Github won't let me post <bos> into a code block https://pastes.io/pawpw2voog

However Mistral works perfectly with the same prompt.

I also have the same problem with almost all the models... I have put aside the interest in Unsloth until the community finds the issue and it's fixed.

danielhanchen commented 5 months ago

@HirCoir Apologies sorry was extremely busy this week so didn't have time to look at this! I'll see what I can do! @JIBSIL Also sorry did not respond until now! This is after a finetune right? Did you manage to add EOS during finetuning?

Also @HirCoir you said this happens with non Gemma as well? Did you add EOS to the end of each prompt after finetuning? Are you calling a base model or instruct versions (with or no finetuning?)

JIBSIL commented 4 months ago

@danielhanchen - The issue seems to have fixed itself - looks like it was a case of persisting variables meddling with something in my notebook.

This may be a bit out of scope for the issue, but perhaps you would be able to help. I am testing batch generation using a set of variable length texts, which can have a speedup of over 10x under the correct circumstances. However, it seems to just be generating gibberish. I've tested a few strategies outlined here and in this pr but it appears that the only real resources that exist for this are for GPT-2.

I am initializing like this:

model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "unsloth/gemma-7b-it-bnb-4bit",
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
        device_map = "auto"
        # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
    )

    tokenizer.padding_side = "left"

and generating using this code:

inputs = tokenizer(batch, return_tensors="pt", truncation=True, padding='longest').to("cuda")

        outputs = model.generate(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'], max_new_tokens=32, do_sample=False)

        output = tokenizer.batch_decode(outputs)

which yields this output


<pad><pad>...
...prompt...
### Response:

Sure dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises dises```
danielhanchen commented 4 months ago

@JIBSIL Fixed batched inference yesterday (after your comment!!) See https://github.com/unslothai/unsloth/issues/267#issuecomment-2034047189 for more info. You'll need to update Unsloth without any dependency updates via pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git for local installations (no need for Colab / Kaggle)

JIBSIL commented 4 months ago

@JIBSIL Fixed batched inference yesterday (after your comment!!) See #267 (comment) for more info. You'll need to update Unsloth without any dependency updates via pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git for local installations (no need for Colab / Kaggle)

Looks like it works! However- when trying to repeatedly generate batches, a CUDA out of memory error eventually occurs:


... notebook runs for over an hour without issues ...

processing entry 46/300
97.97s - 18827tok - 192.16tok/s
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
processing entry 47/300
94.62s - 18313tok - 193.55tok/s
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
processing entry 48/300
97.0s - 18287tok - 188.53tok/s
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
processing entry 49/300

OutOfMemoryError: CUDA out of memory. Tried to allocate 214.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 157.06 MiB is free. Process 2338 has 14.59 GiB memory in use. Of the allocated memory 14.25 GiB is allocated by PyTorch, and 220.69 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The offending code:

    def generate_batch_sample(batch):
        time_initial = time.time()

        inputs = tokenizer(batch, return_tensors="pt", padding=True).to("cuda")

        outputs = model.generate(**inputs, max_new_tokens=192, do_sample=True)

        output = tokenizer.batch_decode(outputs)

        time_final = time.time()

        chars = ""

        for generation in output:
            chars += f" {generation}"

        num_tokens = round(len(chars.split(" ")) * (4/3))

        print(f'{round(time_final - time_initial, 2)}s - {num_tokens}tok - {round(num_tokens / (time_final - time_initial), 2)}tok/s')
        return output

    max_generations = 300 # ~107s per generation, 9 hours
    generations = []

    with open('/kaggle/working/response_output_intermittent.txt', 'w') as outfile:
        for i in range(max_generations):
            print(f'processing entry {i}/{max_generations}')
            i_generation = generate_batch_sample(batches[i])
            torch.cuda.empty_cache()
            gc.collect()
            generations += i_generation

            for generation in i_generation:
                outfile.write(f'\n<start_response_sequence>{generation}<end_responce_sequence>')

It's quite strange, because the code worked a few versions ago (albeit on the 2b version; perhaps the use of the 7b version is causing the error). Could there be a GPU memory leak?

danielhanchen commented 4 months ago

@JIBSIL Hmmm could be im not freeing something - let me check

danielhanchen commented 4 months ago

@JIBSIL Much apologies on the delay - its possible there are small memory fragmentations over time, which will cause OOMs, but ye its possibly cause ur doing 7B and not 2B now, and so it gets more pronounced - I don't see a memory leak in my tests for now ie below uses like 20GB of VRAM on a L4:

for _ in range(10):
    text = ["Hello my name is"]*128
    input_ids = tokenizer(text, return_tensors = "pt").to("cuda")
    x = model.generate(**input_ids, max_new_tokens = 16, use_cache = True, do_sample = True)

Then I used nvidia-smi to check for VRAM usage, and it doesn't look like there are any leaks (for now)

It could be intermittent, though unsure

JIBSIL commented 4 months ago

@JIBSIL Much apologies on the delay - its possible there are small memory fragmentations over time, which will cause OOMs, but ye its possibly cause ur doing 7B and not 2B now, and so it gets more pronounced - I don't see a memory leak in my tests for now ie below uses like 20GB of VRAM on a L4:

for _ in range(10):
    text = ["Hello my name is"]*128
    input_ids = tokenizer(text, return_tensors = "pt").to("cuda")
    x = model.generate(**input_ids, max_new_tokens = 16, use_cache = True, do_sample = True)

Then I used nvidia-smi to check for VRAM usage, and it doesn't look like there are any leaks (for now)

It could be intermittent, though unsure

It was actually a problem on my end-- some of the texts in my test set were drastically longer than others. My bad.