Closed ehartford closed 1 year ago
seems like this is the fix:
pip install transformers==4.28.1
https://github.com/LianjiaTech/BELLE/issues/226
maybe that needs to be in the requirements.txt?
I'm getting out of memory on 4-bit 30b with dual-3090s, any advice is appreciated.
Did you use gradient checkpointing?
I did not think I would need to, with dual-3090s. I will try that.
that didn't fix.
my command line: torchrun --nnodes=1 --nproc-per-node=2 finetune.py --grad_chckpt --groupsize 128 --cutoff_len 2048 --llama_q4_model ./llama-30b-4bit-128g.safetensors --llama_q4_config_dir ./llama-30b-4bit/ --lora_out_dir ./alpaca1337-30b-4bit/ ./leet10k-alpaca-merged.json
$ nvidia-smi nvlink --status
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-eee9504e-62e0-2531-1c63-a7521e83ec0e)
Link 0: 14.062 GB/s
Link 1: 14.062 GB/s
Link 2: 14.062 GB/s
Link 3: 14.062 GB/s
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-91787ba8-6958-941d-1cc7-369dfe15ba06)
Link 0: 14.062 GB/s
Link 1: 14.062 GB/s
Link 2: 14.062 GB/s
Link 3: 14.062 GB/s
$ python
Python 3.10.11 (main, Apr 20 2023, 19:02:41) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.device_count()
2
>>> torch.cuda.get_device_name(1)
'NVIDIA GeForce RTX 3090'
>>> torch.cuda.get_device_name(0)
'NVIDIA GeForce RTX 3090'
>>> quit()
error message:
File "/home/eric/git/alpaca_lora_4bit/finetune.py", line 179, in <module>
trainer.train()
File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
return inner_training_loop(
File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/transformers/trainer.py", line 2709, in training_step
self.scaler.scale(loss).backward()
File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply
return user_fn(self, *args)
File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 157, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 372.00 MiB (GPU 0; 24.00 GiB total capacity; 21.49 GiB already allocated; 0 bytes free; 23.10 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
changing --cutoff_len to 1024 got it to work but my single 4090 could do that, shouldn't dual-3090s be able to handle 2048?
Use this: load_llama_model_4bit_low_ram_and_offload in finetune.py and set memory limit to {0: '9Gib', 1: '9Gib', 'cpu': '24Gib'} And try not running it with torchrun
Thanks I'll try that
I modified finetune.py as follows:
# Load Basic Model
model, tokenizer = load_llama_model_4bit_low_ram_and_offload(
ft_config.llama_q4_config_dir,
ft_config.llama_q4_model,
# device_map=ft_config.device_map,
groupsize=ft_config.groupsize,
is_v1_model=ft_config.v1,
max_memory={0: "9Gib", 1: "9Gib", "cpu": "24Gib"},
)
# Config Lora
lora_config = LoraConfig(
r=ft_config.lora_r,
lora_alpha=ft_config.lora_alpha,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=ft_config.lora_dropout,
bias="none",
task_type="CAUSAL_LM",
)
then I run it like this:
python finetune.py --grad_chckpt --resume_checkpoint WizardLM1337-13b-4bit/checkpoint-8950/ --groupsize 128 --xformers --cutoff_len 2048 --llama_q4_model ./llama-13b-4bit-128g.safetensors --llama_q4_config_dir ./llama-13b-4bit/ --lora_out_dir ./WizardLM1337-13b-4bit/ ./leet10k-WizardLM-merged.json
and it's working, but I see that only one GPU is utilized.
Yes, because the model are split on 2 GPU and the computation cycle is sequential, It can only utilize 1 GPU at the a time.
Yes, because the model are split on 2 GPU and the computation cycle is sequential, It can only utilize 1 GPU at the a time.
@johnsmith0031 would this mean I cannot use your library to train with 2 gpus at 2x speed, or 4 gpus at 4x speed? I have 4x 3090s but it would be a bummer to have no speed increase when training in 4 bits.
Is this library incompatible for pipeline parallelism, then? Sorry for the newb ish questions.
What would you suggest then to make full use of my GPU's for faster training, 8 bit with hf peft? or another library like deepspeed?
Since the finetuning use the Trainer from transformers, I think It supports it natively. Just not sure how to run it, maybe using accelerate launcher?
Is torchrun a new command? I have successfully run on 2x3090 as of about 1.5 weeks ago using a different command, and haves updated since.
@johnsmith0031 would this mean I would effectively get 2x speed if I do not spread the model across multiple gpus? I would like a way to have both gpus computing at the same time but from what i can tell due to sequential computation, that's not happening? correct me if im wrong as im just beginning to grasp the concept. thank you
Is torchrun a new command? I have successfully run on 2x3090 as of about 1.5 weeks ago using a different command, and haves updated since.
@tensiondriven were you able to get faster speeds in training times with 2x 3090s vs the time it would take on one? I saw your comment in another issue mentioning the power draw was only for one so im confused how the training is more efficient if its only using one at a time.
@johnsmith0031 if you have any explanation to add for what I just said that I'd greatly appreciate it, thank you.
not familiar with torchrun and deepspeed, but the accelerate launcher load models on all gpus (each gpu with a copy) which can make batch size larger. I think it just need a little change in the code. In theory, with larger batch size, less steps are needed, so whole training would be faster.
that makes sense, I will try it out and post the results.
Has there been any progress made on this issue? If we are running with more than 1 GPU, they should all be efficiently utilized. As it stands now, only one GPU actually computes at a time... which kind of defeats the point.
When attempting to run with accelerate or Deepspeed, it just runs two separate instances of the trainer and subsequently runs out of memory.
In theory I think it is compatible with deepspeed, but not sure how to implement because I'm not so familiar with working on multiple gpu...
Cool, I got multi-gpu inference working with FastChat (vicuna's toolchain) I'll close this.
I tried to run with 2 GPUs with following command:
torchrun --nproc_per_node=2 --master_port=1234 finetune.py --gro upsize 128 --cutoff_len 2048 --llama_q4_model ./llama-30b-4bit-128g.safetensors --llama_q4_config_dir ./llama-30b-4bit/ --lora_out_dir ./alpaca1337-30b-4bit/ ./leet10k-alpaca-merged.json
I got error:
Did I make a mistake in the arguments?
The full trace: