unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
17.69k stars 1.23k forks source link

Alpaca + Mistral 7b full example notebook fails without error after a few hours of running #178

Open Aliendydo opened 8 months ago

Aliendydo commented 8 months ago

Hi there,

I'm testing out the Alpaca + Mistral 7b full example notebook by applying it to finetune using QLORA with a custom dataset that I formatted using the example format function. For context, I've also successfully done it with this exact dataset in Axolotl, but I want to see if this is more efficient in terms of VRAM usage (so far, it definitely seems to be the case).

The only thing I've changed in the notebook is the dataset and I changed num_train_epochs to 3 to make it comparable to my earlier training run in Axolotl. This is my trainer code: from trl import SFTTrainer from transformers import TrainingArguments

trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset, dataset_text_field = "text", max_seq_length = max_seq_length, dataset_num_proc = 2, packing = False, # Can make training 5x faster for short sequences. args = TrainingArguments( per_device_train_batch_size = 2, gradient_accumulation_steps = 4, warmup_steps = 5, num_train_epochs=3, learning_rate = 2e-4, fp16 = not torch.cuda.is_bf16_supported(), bf16 = torch.cuda.is_bf16_supported(), logging_steps = 1, optim = "adamw_8bit", weight_decay = 0.01, lr_scheduler_type = "linear", seed = 3407, output_dir = "outputs", ), ) Everything starts off fine. However, after training for about 0.43 epochs (=about 3 hours) the notebook seems to fail without an error. I've tried using a bigger GPU (A V100) to rule out OOM errors (even though peak VRAM usage seems to be not that high), but that doesn't work. Tried running it 3 different times now but the same thing happens every time. Any suggestions on what's going wrong?

danielhanchen commented 8 months ago

@Aliendydo Oh my that's not good - did you do this on a free Colab notebook? Do you know what is your max_seq_length?

Aliendydo commented 8 months ago

Thanks for the quick reply! Paid Colab notebook, didn't change the max_seq_length so it's still at the default 2048. I tried it again earlier today (no changes to the notebook), thinking that maybe it perhaps had something to do with Colab disconnecting its runtime for reasons on Google's side yesterday, and it's been going on 10 hours now (3 epochs takes about 15 hours for my dataset), so fingers crossed!

I'm not sure if this is about Google's runtimes being unreliable (I've finetuned LLMS on their runtimes for 10+ hours before), or that perhaps it's something that crops up randomly in Unsloth? If you've seen the former happen before I'll just close this and chalk it up to that.

danielhanchen commented 8 months ago

@Aliendydo Interesting - ye it's entirely possible Colab is doing something :( Did it eventually complete? :)

Aliendydo commented 8 months ago

It did complete! I'll chalk it up to probably Colab having some issues on the day I tested. I'll try to run again with per_device_train_batch_size = 1 too, as that was the first run I did when it failed (two failed runs after that were with per_device_train_batch_size = 2, like the default). Just to make sure it's not a problem with that. Will update on if that completes too and if so will close. Thank you!

Aliendydo commented 8 months ago

@danielhanchen alright so update, it failed again with per_device_train_batch_size=1, everything else equal. The process of failing is kind of odd, it will give an error, then apparently disconnect from the runtime, then continue running for another couple of steps (for about 20 minutes) until it finally stops. Pretty sure that the batch size it at least very strongly correlated with this bug. Normally I'd say fine let's run it at per_device_train_batch_size=2, but I'm looking to run this on a local GPU with only 8gb of VRAM, which only works with per_device_train_batch_size=1. Any suggestions on tweaks to address this issue?

danielhanchen commented 8 months ago

@Aliendydo Ok that is very very odd - especially the disconnection then reconnection - this is free Colab right - the paid one works fine?

Is the High RAM T4 fine? Or normal T4 the culprit?

The disconnection might be a thing Colab does to mitigate long training runs, so they auto terminate it. If the Paid Colab works, it has no difference to a Free Colab on the GPU side, except you can run it for very long, and there's more RAM

Aliendydo commented 8 months ago

This was all on paid colab, and the same thing happened with a v100 GPU. This is why I don't think it's a peak vram OOM error. The strange thing is that it worked fine for 13 hours with a per_device_train_batch_size=2, so I also don't think it's something to mitigate long runs by colab or it would have kicked in there. I feel like it must somehow be related to per_device_train_batch_size=1 and perhaps in combination with my my dataset specifically? Though if it's the dataset then I'm not sure why per_device_train_batch_size=2 works fine. Thanks for all the continued tech support btw! Much appreciated.

danielhanchen commented 8 months ago

@Aliendydo Wait so your saying the run which succeeded was also on a Paid Colab, and it succeeded, whilst today / yesterday it failed?

Very very fascinating and weird

Ye I don't think it's related to bsz, as you mentioned bsz=2 could work fine for 13 hours.

Extremely weird - it could be some sort of weird intermittent bug maybe in Triton kernels or Xformers.

I'll try all sequence lengths as a test ie from 0 all the way to say 4096*2 and see if some random sequence length seems to be causing seg faults

Aliendydo commented 8 months ago

Yes exactly, all that we've discussed was on a paid colab. It seems that bsz=1 is virtually guaranteed to fail. Tried another overnight run and failed after a couple of hours again just last night. Thanks for the suggestion! I'll give different sequence lengths a shot

danielhanchen commented 8 months ago

Interesting very weird indeed - I'll do my internal investigations as well! So sorry this is happening - its very weird indeed - and thanks for helping to debug! Appreciate it a lot!

Aliendydo commented 8 months ago

No problems at all, really appreciate the support! For some more debugging pointers: sequence lengths 1024 and 1950 also failed after a few hours (on bsz=1). I'm going to try to run it on a local gpu tomorrow to see if the problem is colab somehow (since there's no error given, the notebook just fails then disconnects from its runtime).

Aliendydo commented 8 months ago

So update on the local GPU run, it completed without issues on bsz=1! Only 1 epoch as I didn't have more time, but it seems like it might a combination of bz1=1 and Google colab specifically that's causing this behavior.

danielhanchen commented 8 months ago

@Aliendydo Interesting! I also forgot to say I tested all sequence lengths from 0 all the way until 4096 on Colab for 6 hours - it seems to be OK with no seg faults on bsz=1.

If I had to guess, it might be a weird Colab issue - ie maybe by bad chance the GPU was broken somewhere, and Google hasn't noticed yet. But all a hunch

Aliendydo commented 8 months ago

Sorry for the late reply, but thanks for your tests too! I think it might have something to do with my dataset and the combination of bsz=1 and colab. I can very consistently recreate failure using that dataset and these settings, so I don't think it's a single GPU issue. But I'll do some more tests on how my dataset differs from the example one after running the formatting function in the notebook. Because after inspecting the finished result of the local run, it seems like the training loss is very high and the results of the qlora keep going until the max generation token length is reached, which is not expected behavior. When I ran this dataset on axolotl I got a training loss that was 20x smaller and a result that was much better (also not generating until running out of tokens). So might have something to do with that, but that's just my hunch. EDIT: Dataset seems to be formatted correctly, at least the output format is exactly the same as what comes out of the example dataset in the notebook. Still not sure what causes the endless generations in that case, because EOS token seems to be placed as it should.

danielhanchen commented 8 months ago

@Aliendydo No problems at all! Hmmmmmmmm very very weird indeed - tbh im stumped :(

danielhanchen commented 8 months ago

I do patch the tokenizer so maybe thats an issue - maybe try FastLanguageModel(..., fix_tokenizer = False)

Aliendydo commented 8 months ago

Thank you, I'll give that a shot!

Aliendydo commented 7 months ago

Alright sorry for the late reply, had some other projects that took precedence. However, I tried the tokenizer fix and unfortunately that didn't work. What ended up working for me in the end with regards to the endless generations is to use unsloth + Llama factory! Super nice that it's implemented there so seamlessly. Weirdly enough, I sometimes still get failed runs there when using unsloth (even at different bsz) and my dataset that do not occur when I use Llama factory without unsloth. It might indeed have something to do with some specific tokens in my dataset, which is strange because it's a synthetic one generated by chatGPT. So it remains a bit of a mystery for now. It's not a blocker on my end but I'll leave this thread open in case anyone runs across something similar.