When running with 2 GPUs, got multiple values for keyword argument 'backend'

ehartford commented 1 year ago

I tried to run with 2 GPUs with following command:

torchrun --nproc_per_node=2 --master_port=1234 finetune.py --gro upsize 128 --cutoff_len 2048 --llama_q4_model ./llama-30b-4bit-128g.safetensors --llama_q4_config_dir ./llama-30b-4bit/ --lora_out_dir ./alpaca1337-30b-4bit/ ./leet10k-alpaca-merged.json

I got error:

TypeError: torch.distributed.distributed_c10d.init_process_group() got multiple values for keyword argument 'backend'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3752) of binary: /home/eric/miniconda3/envs/al4b/bin/python

Did I make a mistake in the arguments?

The full trace:

Traceback (most recent call last):
  File "/home/eric/git/alpaca_lora_4bit/finetune.py", line 134, in <module>
    training_arguments = transformers.TrainingArguments(
  File "<string>", line 110, in __init__
  File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/transformers/training_args.py", line 1255, in __post_init__
    and (self.device.type != "cuda")
  File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/transformers/training_args.py", line 1619, in device
    return self._setup_devices
  File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in __get__
    cached = self.fget(obj)
  File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/transformers/training_args.py", line 1553, in _setup_devices
    self.distributed_state = PartialState(backend=self.xpu_backend)
  File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/accelerate/state.py", line 129, in __init__
    torch.distributed.init_process_group(backend="nccl", **kwargs)
TypeError: torch.distributed.distributed_c10d.init_process_group() got multiple values for keyword argument 'backend'
Traceback (most recent call last):
  File "/home/eric/git/alpaca_lora_4bit/finetune.py", line 134, in <module>
    training_arguments = transformers.TrainingArguments(
  File "<string>", line 110, in __init__
  File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/transformers/training_args.py", line 1255, in __post_init__
    and (self.device.type != "cuda")
  File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/transformers/training_args.py", line 1619, in device
    return self._setup_devices
  File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in __get__
    cached = self.fget(obj)
  File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/transformers/training_args.py", line 1553, in _setup_devices
    self.distributed_state = PartialState(backend=self.xpu_backend)
  File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/accelerate/state.py", line 129, in __init__
    torch.distributed.init_process_group(backend="nccl", **kwargs)
TypeError: torch.distributed.distributed_c10d.init_process_group() got multiple values for keyword argument 'backend'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3752) of binary: /home/eric/miniconda3/envs/al4b/bin/python

ehartford commented 1 year ago

seems like this is the fix: pip install transformers==4.28.1 https://github.com/LianjiaTech/BELLE/issues/226 maybe that needs to be in the requirements.txt?

ehartford commented 1 year ago

I'm getting out of memory on 4-bit 30b with dual-3090s, any advice is appreciated.

johnsmith0031 commented 1 year ago

Did you use gradient checkpointing?

ehartford commented 1 year ago

I did not think I would need to, with dual-3090s. I will try that.

ehartford commented 1 year ago

that didn't fix.

my command line: torchrun --nnodes=1 --nproc-per-node=2 finetune.py --grad_chckpt --groupsize 128 --cutoff_len 2048 --llama_q4_model ./llama-30b-4bit-128g.safetensors --llama_q4_config_dir ./llama-30b-4bit/ --lora_out_dir ./alpaca1337-30b-4bit/ ./leet10k-alpaca-merged.json

$ nvidia-smi nvlink --status
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-eee9504e-62e0-2531-1c63-a7521e83ec0e)
         Link 0: 14.062 GB/s
         Link 1: 14.062 GB/s
         Link 2: 14.062 GB/s
         Link 3: 14.062 GB/s
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-91787ba8-6958-941d-1cc7-369dfe15ba06)
         Link 0: 14.062 GB/s
         Link 1: 14.062 GB/s
         Link 2: 14.062 GB/s
         Link 3: 14.062 GB/s

$ python
Python 3.10.11 (main, Apr 20 2023, 19:02:41) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.device_count()
2
>>> torch.cuda.get_device_name(1)
'NVIDIA GeForce RTX 3090'
>>> torch.cuda.get_device_name(0)
'NVIDIA GeForce RTX 3090'
>>> quit()

error message:

 File "/home/eric/git/alpaca_lora_4bit/finetune.py", line 179, in <module>
    trainer.train()
  File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
    return inner_training_loop(
  File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/transformers/trainer.py", line 2709, in training_step
    self.scaler.scale(loss).backward()
  File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 157, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/eric/miniconda3/envs/al4b/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 372.00 MiB (GPU 0; 24.00 GiB total capacity; 21.49 GiB already allocated; 0 bytes free; 23.10 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

ehartford commented 1 year ago

changing --cutoff_len to 1024 got it to work but my single 4090 could do that, shouldn't dual-3090s be able to handle 2048?

johnsmith0031 commented 1 year ago

Use this: load_llama_model_4bit_low_ram_and_offload in finetune.py and set memory limit to {0: '9Gib', 1: '9Gib', 'cpu': '24Gib'} And try not running it with torchrun

ehartford commented 1 year ago

Thanks I'll try that

ehartford commented 1 year ago

I modified finetune.py as follows:

# Load Basic Model
model, tokenizer = load_llama_model_4bit_low_ram_and_offload(
    ft_config.llama_q4_config_dir,
    ft_config.llama_q4_model,
    # device_map=ft_config.device_map,
    groupsize=ft_config.groupsize,
    is_v1_model=ft_config.v1,
    max_memory={0: "9Gib", 1: "9Gib", "cpu": "24Gib"},
)

# Config Lora
lora_config = LoraConfig(
    r=ft_config.lora_r,
    lora_alpha=ft_config.lora_alpha,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=ft_config.lora_dropout,
    bias="none",
    task_type="CAUSAL_LM",
)

then I run it like this: python finetune.py --grad_chckpt --resume_checkpoint WizardLM1337-13b-4bit/checkpoint-8950/ --groupsize 128 --xformers --cutoff_len 2048 --llama_q4_model ./llama-13b-4bit-128g.safetensors --llama_q4_config_dir ./llama-13b-4bit/ --lora_out_dir ./WizardLM1337-13b-4bit/ ./leet10k-WizardLM-merged.json

and it's working, but I see that only one GPU is utilized.

johnsmith0031 commented 1 year ago

Yes, because the model are split on 2 GPU and the computation cycle is sequential, It can only utilize 1 GPU at the a time.

tpfwrz commented 1 year ago

Yes, because the model are split on 2 GPU and the computation cycle is sequential, It can only utilize 1 GPU at the a time.

@johnsmith0031 would this mean I cannot use your library to train with 2 gpus at 2x speed, or 4 gpus at 4x speed? I have 4x 3090s but it would be a bummer to have no speed increase when training in 4 bits.

Is this library incompatible for pipeline parallelism, then? Sorry for the newb ish questions.

What would you suggest then to make full use of my GPU's for faster training, 8 bit with hf peft? or another library like deepspeed?

johnsmith0031 commented 1 year ago

Since the finetuning use the Trainer from transformers, I think It supports it natively. Just not sure how to run it, maybe using accelerate launcher?

tensiondriven commented 1 year ago

Is torchrun a new command? I have successfully run on 2x3090 as of about 1.5 weeks ago using a different command, and haves updated since.

tpfwrz commented 1 year ago

@johnsmith0031 would this mean I would effectively get 2x speed if I do not spread the model across multiple gpus? I would like a way to have both gpus computing at the same time but from what i can tell due to sequential computation, that's not happening? correct me if im wrong as im just beginning to grasp the concept. thank you

tpfwrz commented 1 year ago

Is torchrun a new command? I have successfully run on 2x3090 as of about 1.5 weeks ago using a different command, and haves updated since.

@tensiondriven were you able to get faster speeds in training times with 2x 3090s vs the time it would take on one? I saw your comment in another issue mentioning the power draw was only for one so im confused how the training is more efficient if its only using one at a time.

@johnsmith0031 if you have any explanation to add for what I just said that I'd greatly appreciate it, thank you.

johnsmith0031 commented 1 year ago

not familiar with torchrun and deepspeed, but the accelerate launcher load models on all gpus (each gpu with a copy) which can make batch size larger. I think it just need a little change in the code. In theory, with larger batch size, less steps are needed, so whole training would be faster.

tpfwrz commented 1 year ago

that makes sense, I will try it out and post the results.

zohfur commented 1 year ago

Has there been any progress made on this issue? If we are running with more than 1 GPU, they should all be efficiently utilized. As it stands now, only one GPU actually computes at a time... which kind of defeats the point.

When attempting to run with accelerate or Deepspeed, it just runs two separate instances of the trainer and subsequently runs out of memory.

johnsmith0031 commented 1 year ago

In theory I think it is compatible with deepspeed, but not sure how to implement because I'm not so familiar with working on multiple gpu...

ehartford commented 1 year ago

Cool, I got multi-gpu inference working with FastChat (vicuna's toolchain) I'll close this.

johnsmith0031 / alpaca_lora_4bit

When running with 2 GPUs, got multiple values for keyword argument 'backend' #95