Open Aliendydo opened 8 months ago
@Aliendydo Oh my that's not good - did you do this on a free Colab notebook? Do you know what is your max_seq_length
?
Thanks for the quick reply! Paid Colab notebook, didn't change the max_seq_length so it's still at the default 2048. I tried it again earlier today (no changes to the notebook), thinking that maybe it perhaps had something to do with Colab disconnecting its runtime for reasons on Google's side yesterday, and it's been going on 10 hours now (3 epochs takes about 15 hours for my dataset), so fingers crossed!
I'm not sure if this is about Google's runtimes being unreliable (I've finetuned LLMS on their runtimes for 10+ hours before), or that perhaps it's something that crops up randomly in Unsloth? If you've seen the former happen before I'll just close this and chalk it up to that.
@Aliendydo Interesting - ye it's entirely possible Colab is doing something :( Did it eventually complete? :)
It did complete! I'll chalk it up to probably Colab having some issues on the day I tested. I'll try to run again with per_device_train_batch_size = 1 too, as that was the first run I did when it failed (two failed runs after that were with per_device_train_batch_size = 2, like the default). Just to make sure it's not a problem with that. Will update on if that completes too and if so will close. Thank you!
@danielhanchen alright so update, it failed again with per_device_train_batch_size=1, everything else equal. The process of failing is kind of odd, it will give an error, then apparently disconnect from the runtime, then continue running for another couple of steps (for about 20 minutes) until it finally stops. Pretty sure that the batch size it at least very strongly correlated with this bug. Normally I'd say fine let's run it at per_device_train_batch_size=2, but I'm looking to run this on a local GPU with only 8gb of VRAM, which only works with per_device_train_batch_size=1. Any suggestions on tweaks to address this issue?
@Aliendydo Ok that is very very odd - especially the disconnection then reconnection - this is free Colab right - the paid one works fine?
Is the High RAM T4 fine? Or normal T4 the culprit?
The disconnection might be a thing Colab does to mitigate long training runs, so they auto terminate it. If the Paid Colab works, it has no difference to a Free Colab on the GPU side, except you can run it for very long, and there's more RAM
This was all on paid colab, and the same thing happened with a v100 GPU. This is why I don't think it's a peak vram OOM error. The strange thing is that it worked fine for 13 hours with a per_device_train_batch_size=2, so I also don't think it's something to mitigate long runs by colab or it would have kicked in there. I feel like it must somehow be related to per_device_train_batch_size=1 and perhaps in combination with my my dataset specifically? Though if it's the dataset then I'm not sure why per_device_train_batch_size=2 works fine. Thanks for all the continued tech support btw! Much appreciated.
@Aliendydo Wait so your saying the run which succeeded was also on a Paid Colab, and it succeeded, whilst today / yesterday it failed?
Very very fascinating and weird
Ye I don't think it's related to bsz, as you mentioned bsz=2 could work fine for 13 hours.
Extremely weird - it could be some sort of weird intermittent bug maybe in Triton kernels or Xformers.
I'll try all sequence lengths as a test ie from 0 all the way to say 4096*2 and see if some random sequence length seems to be causing seg faults
Yes exactly, all that we've discussed was on a paid colab. It seems that bsz=1 is virtually guaranteed to fail. Tried another overnight run and failed after a couple of hours again just last night. Thanks for the suggestion! I'll give different sequence lengths a shot
Interesting very weird indeed - I'll do my internal investigations as well! So sorry this is happening - its very weird indeed - and thanks for helping to debug! Appreciate it a lot!
No problems at all, really appreciate the support! For some more debugging pointers: sequence lengths 1024 and 1950 also failed after a few hours (on bsz=1). I'm going to try to run it on a local gpu tomorrow to see if the problem is colab somehow (since there's no error given, the notebook just fails then disconnects from its runtime).
So update on the local GPU run, it completed without issues on bsz=1! Only 1 epoch as I didn't have more time, but it seems like it might a combination of bz1=1 and Google colab specifically that's causing this behavior.
@Aliendydo Interesting! I also forgot to say I tested all sequence lengths from 0 all the way until 4096 on Colab for 6 hours - it seems to be OK with no seg faults on bsz=1.
If I had to guess, it might be a weird Colab issue - ie maybe by bad chance the GPU was broken somewhere, and Google hasn't noticed yet. But all a hunch
Sorry for the late reply, but thanks for your tests too! I think it might have something to do with my dataset and the combination of bsz=1 and colab. I can very consistently recreate failure using that dataset and these settings, so I don't think it's a single GPU issue. But I'll do some more tests on how my dataset differs from the example one after running the formatting function in the notebook. Because after inspecting the finished result of the local run, it seems like the training loss is very high and the results of the qlora keep going until the max generation token length is reached, which is not expected behavior. When I ran this dataset on axolotl I got a training loss that was 20x smaller and a result that was much better (also not generating until running out of tokens). So might have something to do with that, but that's just my hunch. EDIT: Dataset seems to be formatted correctly, at least the output format is exactly the same as what comes out of the example dataset in the notebook. Still not sure what causes the endless generations in that case, because EOS token seems to be placed as it should.
@Aliendydo No problems at all! Hmmmmmmmm very very weird indeed - tbh im stumped :(
I do patch the tokenizer so maybe thats an issue - maybe try FastLanguageModel(..., fix_tokenizer = False)
Thank you, I'll give that a shot!
Alright sorry for the late reply, had some other projects that took precedence. However, I tried the tokenizer fix and unfortunately that didn't work. What ended up working for me in the end with regards to the endless generations is to use unsloth + Llama factory! Super nice that it's implemented there so seamlessly. Weirdly enough, I sometimes still get failed runs there when using unsloth (even at different bsz) and my dataset that do not occur when I use Llama factory without unsloth. It might indeed have something to do with some specific tokens in my dataset, which is strange because it's a synthetic one generated by chatGPT. So it remains a bit of a mystery for now. It's not a blocker on my end but I'll leave this thread open in case anyone runs across something similar.
Hi there,
I'm testing out the Alpaca + Mistral 7b full example notebook by applying it to finetune using QLORA with a custom dataset that I formatted using the example format function. For context, I've also successfully done it with this exact dataset in Axolotl, but I want to see if this is more efficient in terms of VRAM usage (so far, it definitely seems to be the case).
The only thing I've changed in the notebook is the dataset and I changed num_train_epochs to 3 to make it comparable to my earlier training run in Axolotl. This is my trainer code: from trl import SFTTrainer from transformers import TrainingArguments
trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset, dataset_text_field = "text", max_seq_length = max_seq_length, dataset_num_proc = 2, packing = False, # Can make training 5x faster for short sequences. args = TrainingArguments( per_device_train_batch_size = 2, gradient_accumulation_steps = 4, warmup_steps = 5, num_train_epochs=3, learning_rate = 2e-4, fp16 = not torch.cuda.is_bf16_supported(), bf16 = torch.cuda.is_bf16_supported(), logging_steps = 1, optim = "adamw_8bit", weight_decay = 0.01, lr_scheduler_type = "linear", seed = 3407, output_dir = "outputs", ), ) Everything starts off fine. However, after training for about 0.43 epochs (=about 3 hours) the notebook seems to fail without an error. I've tried using a bigger GPU (A V100) to rule out OOM errors (even though peak VRAM usage seems to be not that high), but that doesn't work. Tried running it 3 different times now but the same thing happens every time. Any suggestions on what's going wrong?