unslothai / unsloth

Finetune Llama 3, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
12.18k stars 791 forks source link

CUDA_VISIBILE_DEVICES not functioning #660

Open xtchen96 opened 2 weeks ago

xtchen96 commented 2 weeks ago

I saw error message when I am trying to do supervised fine tuning with 4xA100 GPUs. So the free version cannot be used on multiple GPUs?

RuntimeError: Error: More than 1 GPUs have a lot of VRAM usage. Please obtain a commercial license.

danielhanchen commented 2 weeks ago

Oh currently Unsloth does not support multi GPU sorry - our enterprise plans have them for now - we're currently concentrated on adding Ollama, Llama-3 bug fixes, all model support and more in the OSS

aflah02 commented 2 weeks ago

@danielhanchen Is there a way to run Unsloth on only 1 GPU when I have a 2 GPU Node? I get the same error and I want to use only 1 GPU as the model easily fits on it? I tried

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

But it did not work

erwe324 commented 2 weeks ago

Export it via shell before running the python script

danielhanchen commented 2 weeks ago

Yep you have to set the env variable before running Unsloth

miary commented 1 week ago

Setting the env variable before running Unsloth still does not resolve the problem.

Used: export CUDA_VISIBLE_DEVICES=0 but it still comes up with the error: RuntimeError: Error: More than 1 GPUs have a lot of VRAM usage. Please obtain a commercial license.

Also used: export CUDA_VISIBLE_DEVICES=1 but same problem.

erwe324 commented 1 week ago

@danielhanchen I am confused Kindly help. The error is asking to get a commercial licence.

@miary what GPUs are you using and are they already running another job?

miary commented 1 week ago

@danielhanchen I have 2 GPUs, both RTX 3090. This runtime error about more than 1 GPU is a brand new issue that came from Unsloth 2024.6.

I have a project that is using Unsloth 2024.5 and it works just fine.

It is completely fine is Unsloth wants to charge for environment with more than one GPU. However, the option should be given to use only one GPU, which is what setting the CUDA_VISIBLE_DEVICES env is supposed to do, but it's apparently broken. Looks like a really bad bug because it breaks the entire project.

danielhanchen commented 1 week ago

Hmm I shall investigate this hmmm.

How do you all call Unsloth? Via the terminal as a python script? Via Jupyter?

Chirobocea commented 1 week ago

I am using python script and had the same issue while trying to run on GPU 1 (if i set the code to have visibility only on GPU 0 it works fine).

I am using this as the first lines in my main code:

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
os.environ["GRADIO_SHARE"]="1"
os.environ["WORLD_SIZE"] = "1"
danielhanchen commented 1 week ago

@Chirobocea So do you use python train.py or like torchrun?

Chirobocea commented 1 week ago

Usually I use python train.py. However, I just tried to lunch it with torchrun and it has same issue. Also I checked with the debugger that torch indeed sees only one gpu, which is renamed for the running code to id 0, while in the loading process of the model it takes VRAM only from GPU 1 as expected (from nvidia-smi).

danielhanchen commented 1 week ago

Ok thanks for the info! Running in runpod to see what I can do! :)

danielhanchen commented 1 week ago

@miary @Chirobocea @aflah02 Just fixed it! Hopefully it now can work! Apologies on the issues! Please update Unsloth via

pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
molander commented 1 week ago

RuntimeError: Error: More than 1 GPUs have a lot of VRAM usage. Please obtain a commercial license.

Thanks for all your work, btw! Killer project!

==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1 \ /| Num examples = 1,029 | Num Epochs = 1 O^O/ _/ \ Batch size per device = 2 | Gradient Accumulation steps = 4 \ / Total batch size = 8 | Total steps = 30 "-____-" Number of trainable parameters = 41,943,040 Traceback (most recent call last): File "/home/matto/projects/baby-code/workspace/unsloth-orpo.py", line 128, in orpo_trainer.train() File "/home/matto/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train return inner_training_loop( File "", line 226, in _fast_inner_training_loop RuntimeError: Error: More than 1 GPUs have a lot of VRAM usage. Please obtain a commercial license.

molander commented 1 week ago

Can confirm it does not occur with unsloth-2024.5 but does at unsloth-2024.6 If necessary, one can downgrade via: pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git@f9689b1

danielhanchen commented 1 week ago

@molander Do you know if my latest fix fixes stuff?

molander commented 1 week ago

@danielhanchen no, as soon as I uninstall and pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git, it comes back :(

miary commented 1 week ago

@miary @Chirobocea @aflah02 Just fixed it! Hopefully it now can work! Apologies on the issues! Please update Unsloth via

pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git

This patch did not solve the problem. Same error: RuntimeError: Error: More than 1 GPUs have a lot of VRAM usage. Please obtain a commercial license.

danielhanchen commented 1 week ago

Hmmm weird I tried it in runpod with 4x GPUs and it worked - I shall re try fixing this! Sorry everyone on the issue!

danielhanchen commented 1 week ago

@miary @molander I updated the package again! Apologies on the issues!

I found the below to work (change 1 to any device id)

export CUDA_VISIBLE_DEVICES=1 && python train_file.py

Likewise torchrun also works with that approach.

Hope this works! Thank you for your patience!

Chirobocea commented 1 week ago

@miary @Chirobocea @aflah02 Just fixed it! Hopefully it now can work! Apologies on the issues! Please update Unsloth via

pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git

I tried this one hour ago and checked. It seams that the main problem is when I have something running on other GPU as well. For example if i have another code with another env on GPU 0, I can't run unsloth on GPU 1. The error is the same as before.

molander commented 1 week ago

Confirmed not working as intended. Nothing going on GPU = 1, will not run, even though Num GPUs shows as 1 in Unsloth banner below.

But on GPU = 0, after I closed everything but Xorg, it worked.

So, it would appear at first glance, it must use GPU0 in which case, you have a legitimate workaround, ticket closed, back to the real work ;)

Thank you for open-sourcing. I know that it takes big balls, and I assure you, it's worth it all the way around ;)

max_steps is given, it will override any value given in num_trainepochs ==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1 \ /| Num examples = 1,029 | Num Epochs = 1 O^O/ \/ \ Batch size per device = 2 | Gradient Accumulation steps = 4 \ / Total batch size = 8 | Total steps = 30 "-____-" Number of trainable parameters = 41,943,040

miary commented 1 week ago

@miary @molander I updated the package again! Apologies on the issues!

I found the below to work (change 1 to any device id)

export CUDA_VISIBLE_DEVICES=1 && python train_file.py

Likewise torchrun also works with that approach.

Hope this works! Thank you for your patience!

@danielhanchen Just wanted to confirm that your patch by including export CUDA_VISIBLE_DEVICES=1 works!!! Thanks for all the good work, greatly appreciated!

danielhanchen commented 1 week ago

@miary Great it worked!

@molander Thanks glad it's a workaround - I'll see what I can do. So export CUDA_VISIBLE_DEVICES=1 && python train_file.py still does not work? Do you use torchrun or python or accelerate?

molander commented 1 week ago

@danielhanchen Good to go here! I made a new conda env and conda installed pytorch, transformers, etc and it's working like a mule at the grand canyon! Thanks!

aflah02 commented 1 week ago

Thanks @danielhanchen!!

Arvinzaheri commented 1 week ago

Hi @danielhanchen,

I'm experiencing a similar issue. I want to load Llama3 on two separate A6000 GPUs. When I set CUDA_VISIBLE_DEVICES to "0,1", it works fine on CUDA0. However, when I try to load the model on CUDA1 and generate responses, it fails. I want to have the full model on each GPU, not distribute it across them. Any suggestions on how to resolve this?

user799595 commented 1 week ago

This still isn't working for me. @danielhanchen can you please remove the exception if more than one GPU has 4GB of memory usage?

@molander Did you do anything custom on your end? Looking at the main branch, the code is still there.

vvatter commented 1 week ago

I'm on a node with multiple GPUs, but I only have one in CUDA_VISIBLE_DEVICES.

The issue I'm having is with these lines in the patch_sft_trainer_tokenizer() function of tokenizer_utils.py: https://github.com/unslothai/unsloth/blob/933d9fe2cb2459f949ee2250e90a5b610d277eab/unsloth/tokenizer_utils.py#L961-L970

The check for multiple GPUs here is really a count of how many GPUs on the node are using > 4gb of memory. This is going to fail for anyone on a busy shared node.

I removed that check, and a similar check in llama.py: https://github.com/unslothai/unsloth/blob/933d9fe2cb2459f949ee2250e90a5b610d277eab/unsloth/models/llama.py#L1198-L1207

Then I was able to run unsloth on my node.

danielhanchen commented 2 days ago

Much much apologies on the delay! My brother and I just relocated to SF, so just got back to Github issues!

As per the discussion here, I will instead convert it to a warning for people to say Unsloth is not yet functional for multi GPUs, and will still allow the finetuning process to go through (esp for shared servers)