Open ansz42 opened 7 months ago
@ansz42 Sorry on the delay! Interesting so using our new method rather makes it OOM? Weird
@ansz42 Sorry on the delay! Interesting so using our new method rather makes it OOM? Weird
No worries at all! I appreciate your help.
It works well on Colab, but somehow WSL2 + Jupyter notebook causes an OOM. I suspect this might not be an usual OOM though because RAM, shared VRAM and VRAM use get stuck way below the available amount. It doesn't even try going over 24GB before the OOM error. Let me know if you need me to run anything for troubleshooting.
@ansz42 Apologies llama-3 got the better of me! Hmmm WSL ye the shared RAM could be an issue - I'm unsure if WSL randomnly restricts VRAM usage or something
i got the same error
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
i am following the : https://colab.research.google.com/drive/1_yNCks4BTD5zOnjozppphh5GzMFaMKq_?usp=sharing#scrollTo=2ejIt2xSNKKp colab notebook
current memory stats
GPU = NVIDIA RTX 6000 Ada Generation. Max memory = 47.988 GB.
6.135 GB of memory reserved.```
*parameters *
``` Unsloth - 2x faster free finetuning | Num GPUs = 1
Num examples = 1,131 | Num Epochs = 1
Batch size per device = 2 | Gradient Accumulation steps = 2
Total batch size = 4 | Total steps = 283
Number of trainable parameters = 167,772,160```
i am using wsl + jupyter notebook in vs code
@noviljohnson load_in_4bit = False
?
thanx i ll try
but i resolved by changing the parameter values per_device_train_batch_size = 2, gradient_accumulation_steps = 8,
thank you
Running into the same issue.
Training w/ Unsloth (LLaMA-Factory) through WSL succesfully spills over to system RAM with CUDA Sysmem Fallback Policy
enabled, allowing me to train a 16k context 4-bit qlora on a 10GB RTX 3080.
After enabling "use_gradient_checkpointing": "unsloth"
, it will always OOM. I noticed that it would even OOM in weird scenarios where a 2048 context size worked, but 2047 resulted in OOM, when there was enough VRAM available.
Disabling use_gradient_checkpointing
works, but I would love to use it.
Edit 1
Tried load_in_4bit = False
, no difference. (except for a lot more memory usage ofcourse)
Edit 2
Fun fact, it is actually possible to train 16k context with load_in_4bit = False
using Unsloth, as long as "use_gradient_checkpointing": "unsloth"
is disabled. Extremely slow compared to 4bit, but it works!
Edit 3
Last message in the stack trace is about this line (/unsloth/models/_utils.py
line 388):
saved_hidden_states = hidden_states.to("cpu", non_blocking = True)
Changing this to
saved_hidden_states = hidden_states.to("cpu", non_blocking = False)
Stops the OOM, it's now training. What are the consequences?
@danielhanchen I hope you don't mind the tag here, but this must be related to the issues we're experiencing.
Hey! I ran into this in WSL2 as well. I posted in the other thread https://github.com/unslothai/unsloth/issues/600#issuecomment-2181298507 but I think this is due to pinned memory in wsl + coda... when you don't use gradient checkpoint unsloth I don't think it pins anything in memory (or anything massive). With 95gb ram for some reason WSL only allowed 210 mb of pinned memory. To turn it off you can just say use_gradient_checkpointing=True (to use I guess the HF one).
@m0nsky OOO interesting so non_blocking = False
works?? Hmm maybe I should make a new method called "unsloth-wsl"
for WSL people, to use blocking calls. You will get some slowdowns sadly, since now the transfer of activations to system RAM will block with the GPU
@vladrad Oh yes using use_gradient_checkpointing = True
uses normal HF
@m0nsky OOO interesting so
non_blocking = False
works?? Hmm maybe I should make a new method called"unsloth-wsl"
for WSL people, to use blocking calls. You will get some slowdowns sadly, since now the transfer of activations to system RAM will block with the GPU
I tried it last night and it seemed to be a lot slower indeed, not sure if it's going to be worth it. :(
I guess using unsloth-wsl would work solving this issue. I am interested in seeing if we can make it work with unsloth. Im curious what it's doing at that step if you can elaborate maybe I can poke around. Im guessing it has to be the pinnable memory since I have enough vram (a6000 ada 12gb out of 48 used). Im assuming the other option is HF or not using one at all. Is there an option to toss it into ram vs pinnable ram. is there a difference? Some people using unsloth have it working, but I think it's a combo version of the model size and the training data size using the same small memory space.
wonder if there has been an update or if the author/other participants have found a way around it? 🙏
edit: for me, when use_gradient_checkpointing
is unsloth
I always get the error as OP. Otherwise, True, False, unsloth-wsl - all result in just going OOM (I am using WSL2 with A6000 48GB and am trying out meta-llama-3.1-8b-instruct-4bit with 16_384 but it gives me OOM with the other options and the OP error when using unsloth)
Hi! I followed the conda installation and I am using Jupyter notebook in WSL2. System: 32GB RAM RTX 3090 24GB Ryzen 5 5600x
Error message: