Error while fine tuning with peft, lora, accelerate, SFTConfig and SFTTrainer

Isdriai commented 2 weeks ago

System Info

- `Accelerate` version: 0.34.2
- Platform: Linux-5.14.0-362.24.2.el9_3.x86_64-x86_64-Intel-R-_Xeon-R-_Silver_4216_CPU_@_2.10GHz-with-glibc2.37
- `accelerate` bash location: /project/6045847/user/project/env/bin/accelerate
- Python version: 3.11.5
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.1 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 187.52 GB
- GPU type: Tesla V100-SXM2-32GB #I don't know why it's not specified there is 4 V100
- `Accelerate` default config:
        Not found

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

Hi,

I try to parallelize training on 4 GPU (v100 32GB VRAM). I have a working code for 1 GPU using lora, peft, SFTConfig and SFTTrainer. I tried to add some lines from accelerate (the lib) as I saw on some tutorials to achieve my goal without success.

This is the error I get (I get it 4 times due to the parallelization, but for more clarity, I put only one occurence):

[rank0]: Traceback (most recent call last):
 [rank0]:   File "/project/6045847/user/project/env/lib/python3.11/site-packages/peft/tuners/lora/model.py", line 360, in __getattr__
 [rank0]:     return super().__getattr__(name)  # defer to nn.Module's logic
 [rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
 [rank0]:   File "/project/6045847/user/project/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1709, in __getattr__
 [rank0]:     raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
 [rank0]: AttributeError: 'LoraModel' object has no attribute 'prepare_inputs_for_generation'

 [rank0]: During handling of the above exception, another exception occurred:

 [rank0]: Traceback (most recent call last):
 [rank0]:   File "/project/6045847/user/project/env.py", line 179, in <module>
 [rank0]:     main(args.model_path, data_filename, args.output)
 [rank0]:   File "/project/6045847/user/project/env.py", line 136, in main
 [rank0]:     trainer = SFTTrainer(
 [rank0]:               ^^^^^^^^^^^
 [rank0]:   File "/project/6045847/user/project/env/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
 [rank0]:     return f(*args, **kwargs)
 [rank0]:            ^^^^^^^^^^^^^^^^^^
 [rank0]:   File "/project/6045847/user/project/env/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 268, in __init__
 [rank0]:     model = get_peft_model(model, peft_config)
 [rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 [rank0]:   File "/project/6045847/user/project/env/lib/python3.11/site-packages/peft/mapping.py", line 193, in get_peft_model
 [rank0]:     return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](
 [rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 [rank0]:   File "/project/6045847/user/project/env/lib/python3.11/site-packages/peft/peft_model.py", line 1610, in __init__
 [rank0]:     self.base_model_prepare_inputs_for_generation = self.base_model.prepare_inputs_for_generation
 [rank0]:                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 [rank0]:   File "/project/6045847/user/project/env/lib/python3.11/site-packages/peft/tuners/lora/model.py", line 364, in __getattr__
 [rank0]:     return getattr(self.model, name)
 [rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
 [rank0]:   File "/project/6045847/user/project/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1709, in __getattr__
 [rank0]:     raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
 [rank0]: AttributeError: 'DistributedDataParallel' object has no attribute 'prepare_inputs_for_generation'

This is my code (I don’t put all, just the code about the model itself for more clarity):


 def get_tokenizer(model_id):
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.pad_token = tokenizer.eos_token

    return tokenizer

 def get_model(model_id):

    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True
    )

    device_index = Accelerator().process_index
    device_map = {"": device_index}

    model = AutoModelForCausalLM.from_pretrained(
        model_id, quantization_config=bnb_config, device_map=device_map
    )
    model.config.use_cache=False
    model.config.pretraining_tp=1
    return model

 def main(model_id, data_file, dir_output):

    print("get tokenizer")
    tokenizer = get_tokenizer(model_id)

    print("data")
    raw_data = load_data(data_file)
    training_data, test_data = train_test_split(raw_data, test_size=0.2, random_state=12)

    data = prepare_train_datav2(training_data)

    print("model")
    model = get_model(model_id)

    print("lora")
    peft_config = LoraConfig(
        r=8, lora_alpha=16, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM"
    )
    model = get_peft_model(model, peft_config)

    print("preparation entrainement")
    model.train()

    model.gradient_checkpointing_enable()
    model.enable_input_require_grads()

    accelerator = Accelerator()                 # code added
    model = accelerator.prepare_model(model)    # code added

    training_arguments = SFTConfig(
        output_dir=dir_output,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=64,
        optim="paged_adamw_32bit",
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        save_strategy="epoch",
        logging_steps=10,
        num_train_epochs=3,
        max_steps=250,
        bf16=True,
        push_to_hub=False
    )

    trainer = SFTTrainer( # the error occurs here
        model=model,
        train_dataset=data,
        peft_config=peft_config,
        dataset_text_field="text",
        args=training_arguments,
        tokenizer=tokenizer,
        packing=False,
        max_seq_length=1024
    )

    print("train")

    trainer.train()

The only code I added between the 1 GPU and 4 GPU versions is:

accelerator = Accelerator()
 model = accelerator.prepare_model(model)

And I run the code via a bash script:

 accelerate config
 accelerate launch --multi_gpu script.py --model_path $1 --output $2

The 1 GPU version of this script was:

python script.py --model_path $1 --output $2

I also tried at the end of the main function (deleting ‘model = accelerator.prepare_model(model)’):

accelerator = Accelerator()
 trainer = accelerator.prepare_model(trainer)

 trainer.train()

But this time I have this error:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/project/6045847/user/project/env.py", line 179, in <module>
[rank0]:     main(args.model_path, data_filename, args.output)
[rank0]:   File "/project/6045847/user/project/env.py", line 149, in main
[rank0]:     trainer = accelerator.prepare_model(trainer)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/project/6045847/user/project/env/lib/python3.11/site-packages/accelerate/accelerator.py", line 1380, in prepare_model
[rank0]:     self.verify_device_map(model)
[rank0]:   File "/project/6045847/user/project/env/lib/python3.11/site-packages/accelerate/accelerator.py", line 3534, in verify_device_map
[rank0]:     for m in model.modules():
[rank0]:              ^^^^^^^^^^^^^
[rank0]: AttributeError: 'SFTTrainer' object has no attribute 'modules'
Map:   0%|          | 0/960 [00:00<?, ? examples/s]W1105 11:53:02.954000 23261697507392 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2716513 closing signal SIGTERM
W1105 11:53:02.955000 23261697507392 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2716514 closing signal SIGTERM
W1105 11:53:02.955000 23261697507392 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2716515 closing signal SIGTERM
E1105 11:53:03.434000 23261697507392 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 2716512) of binary: /project/6045847/user/project/env/bin/python
Traceback (most recent call last):
  File "/project/6045847/user/project/env/bin/accelerate", line 8, in <module>
    sys.exit(main())
                     ^^^^^^
  File "/project/6045847/user/project/env/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/project/6045847/user/project/env/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1165, in launch_command
    multi_gpu_launcher(args)
  File "/project/6045847/user/project/env/lib/python3.11/site-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
    distrib_run.run(args)
  File "/project/6045847/user/project/env/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/project/6045847/user/project/env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/project/6045847/user/project/env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
script.py FAILED

I tried to do some fixes as discussed on this https://discuss.huggingface.co/t/multiple-gpu-in-sfttrainer/91899

Unfortunately I still have some errors:


[rank1]: Traceback (most recent call last):
[rank1]:   File "/project/6045847/user/project/script.py", line 178, in <module>
[rank1]:     main(args.model_path, data_filename, args.output)
[rank1]:   File "/project/6045847/user/project/script.py", line 150, in main
[rank1]:     trainer.train()
[rank1]:   File "/project/6045847/user/project/env/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 434, in train
[rank1]:     output = super().train(*args, **kwargs)
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/project/6045847/user/project/env/lib/python3.11/site-packages/transformers/trainer.py", line 2052, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/project/6045847/user/project/env/lib/python3.11/site-packages/transformers/trainer.py", line 2388, in _inner_training_loop
[rank1]:     tr_loss_step = self.training_step(model, inputs)
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/project/6045847/user/project/env/lib/python3.11/site-packages/transformers/trainer.py", line 3518, in training_step
[rank1]:     self.accelerator.backward(loss, **kwargs)
[rank1]:   File "/project/6045847/user/project/env/lib/python3.11/site-packages/accelerate/accelerator.py", line 2196, in backward
[rank1]:     loss.backward(**kwargs)
[rank1]:   File "/project/6045847/user/project/env/lib/python3.11/site-packages/torch/_tensor.py", line 525, in backward
[rank1]:     torch.autograd.backward(
[rank1]:   File "/project/6045847/user/project/env/lib/python3.11/site-packages/torch/autograd/__init__.py", line 267, in backward
[rank1]:     _engine_run_backward(
[rank1]:   File "/project/6045847/user/project/env/lib/python3.11/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank1]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/project/6045847/user/project/env/lib/python3.11/site-packages/torch/autograd/function.py", line 301, in apply
[rank1]:     return user_fn(self, *args)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/project/6045847/user/project/env/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 320, in backward
[rank1]:     torch.autograd.backward(outputs_with_grad, args_with_grad)
[rank1]:   File "/project/6045847/user/project/env/lib/python3.11/site-packages/torch/autograd/__init__.py", line 267, in backward
[rank1]:     _engine_run_backward(
[rank1]:   File "/project/6045847/user/project/env/lib/python3.11/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank1]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank1]: RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
[rank1]: Parameter at index 127 with name base_model.model.model.layers.31.self_attn.v_proj.lora_B.default.weight has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.
W1106 12:34:58.957000 22647939338304 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3711548 closing signal SIGTERM
W1106 12:34:58.958000 22647939338304 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3711550 closing signal SIGTERM
W1106 12:34:58.958000 22647939338304 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3711551 closing signal SIGTERM
E1106 12:34:59.240000 22647939338304 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 1 (pid: 3711549) of binary: /project/6045847/user/project/env/bin/python
Traceback (most recent call last):
  File "/project/6045847/user/project/env/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/project/6045847/user/project/env/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/project/6045847/user/project/env/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1165, in launch_command
    multi_gpu_launcher(args)
  File "/project/6045847/user/project/env/lib/python3.11/site-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
    distrib_run.run(args)
  File "/project/6045847/user/project/env/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/project/6045847/user/project/env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/project/6045847/user/project/env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

This is my code now:


 def get_model(model_id):

    ....

    device_string = PartialState().process_index

    model = AutoModelForCausalLM.from_pretrained(
        model_id, quantization_config=bnb_config, device_map=device_string
    )

    ....

.....

 def main(model_id, data_file, dir_output):

    ....

    training_arguments = SFTConfig(
        ....
        gradient_checkpointing_kwargs={'use_reentrant':False},
        gradient_checkpointing=False
    )

    ....

I removed

accelerator = Accelerator()
model = accelerator.prepare_model(model)

and I modified my bash script:

 accelerate config
 accelerate launch --multi_gpu --num_processes 4 script.py --model_path $1 --output $2

Expected behavior

I would like to use accelerate to make my 1 GPU code working with more GPUs

BenjaminBossan commented 1 week ago

I'm pretty sure that if you use SFTTrainer, there is no need to use accelerate explicitly, as it's handled under the hood. Could you please remove it completely and try again? I'm not sure if that's what you did in your last attempt. If it is, could you please show the final code you ran? The more you can show, the better.

Isdriai commented 1 week ago

The last working code I have (for only 1 GPU) is that (only the model part for clarity):


def get_model(model_id):

    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_id, quantization_config=bnb_config, device_map="auto"
    )
    model.config.use_cache=False
    model.config.pretraining_tp=1
    return model

def get_tokenizer(model_id):
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.pad_token = tokenizer.eos_token

    return tokenizer

def main(model_id, data_file, dir_output):

    print("get tokenizer")
    tokenizer = get_tokenizer(model_id)

    print("data")
    raw_data = load_data(data_file)
    training_data, test_data = train_test_split(raw_data, test_size=0.2, random_state=12)

    data = prepare_train_datav2(training_data)

    print("model")
    model = get_model(model_id)

    print("lora")
    peft_config = LoraConfig(
        r=8, lora_alpha=16, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM"
    )
    model = get_peft_model(model, peft_config)

    print("preparation entrainement")
    model.train()

    model.gradient_checkpointing_enable()
    model.enable_input_require_grads()

    training_arguments = SFTConfig(
        output_dir=dir_output,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=64,
        optim="paged_adamw_32bit",
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        save_strategy="epoch",
        logging_steps=10,
        num_train_epochs=3,
        max_steps=250,
        bf16=True,
        push_to_hub=False
    )

    trainer = SFTTrainer(
        model=model,
        train_dataset=data,
        peft_config=peft_config,
        dataset_text_field="text",
        args=training_arguments,
        tokenizer=tokenizer,
        packing=False,
        max_seq_length=1024
    )

    print("train")

    show_cuda_memory()

    trainer.train()

I ran the code with 1 or 4 GPUs without any change (I run the code on a distant server where "slurm" is used so I can easily ask 1 or 4 GPUs for different tasks):

1 GPU output:

train
GPU 0:
   Total Memory: 34072559616
   Memory Reserved: 5809111040
   Memory Allocated: 5716698624
  0%|          | 1/250 [04:52<20:12:53, 292.26s/it]

4 GPUs output:

train
GPU 0:
   Total Memory: 34072559616
   Memory Reserved: 5809111040
   Memory Allocated: 5716698624
GPU 1:
   Total Memory: 34072559616
   Memory Reserved: 0
   Memory Allocated: 0
GPU 2:
   Total Memory: 34072559616
   Memory Reserved: 0
   Memory Allocated: 0
GPU 3:
   Total Memory: 34072559616
   Memory Reserved: 0
   Memory Allocated: 0
  0%|          | 0/250 [00:00<?, ?it/s]/project/6045847/user/project/env/lib/python3.11/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
  0%|          | 1/250 [04:18<17:52:03, 258.33s/it]

We can see when I ask 1 GPU it will take 20h of training and when I use 4 GPUs it will take 18h so not a lot of differences. I would hope with 4 GPUs it will last almost 4X less time than with 1 GPU. Also we can see only one GPU is used when I ask 4 GPUs.

So apparently SFTTrainer doesn't know how to use all GPUs when there is more than 1 GPU. I also tried that where people advise to add for when they use more than 1 GPU:


def get_model(model_id):

    .......

    device_string = PartialState().process_index
    model = AutoModelForCausalLM.from_pretrained(
        .......... device_map={'':device_string}
    )

   .......

I have the same result, the train will approximatively take 18-20h like when I use only 1 GPU

Isdriai commented 1 week ago

After several attempts trying different options, I noticed that my code is indeed using multiple GPUs, but I'm observing some strange behavior. Specifically, when I run my code with 1 GPU, it takes about 19.5hours. When using 4 GPUs, the time drops only slightly to 17.5hours. However, when using 2 GPUs, the runtime is significantly better, around 9 hours, which is actually faster than when I use 3 (13h) or 4 GPUs. The outputs I have:

4 GPU output (~17h30)


train
GPU 0:
   Total Memory: 34072559616
   Memory Reserved: 5809111040
   Memory Allocated: 5716698624
GPU 1:
   Total Memory: 34072559616
   Memory Reserved: 0
   Memory Allocated: 0
GPU 2:
   Total Memory: 34072559616
   Memory Reserved: 0
   Memory Allocated: 0
GPU 3:
   Total Memory: 34072559616
   Memory Reserved: 0
   Memory Allocated: 0
  0%|          | 0/250 [00:00<?, ?it/s]/project/6045847/user/project/env/lib/python3.11/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
  1%|          | 3/250 [12:38<17:20:04, 252.65s/it]

3 GPU output (~13h)

train
GPU 0:
   Total Memory: 34072559616
   Memory Reserved: 5809111040
   Memory Allocated: 5716698624
GPU 1:
   Total Memory: 34072559616
   Memory Reserved: 0
   Memory Allocated: 0
GPU 2:
   Total Memory: 34072559616
   Memory Reserved: 0
   Memory Allocated: 0
  0%|          | 0/250 [00:00<?, ?it/s]/project/6045847/user/project/env/lib/python3.11/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
  2%|▏         | 4/250 [12:32<12:51:30, 188.17s/it]

2 GPUs output (~9h)

train
GPU 0:
   Total Memory: 34072559616
   Memory Reserved: 5809111040
   Memory Allocated: 5716698624
GPU 1:
   Total Memory: 34072559616
   Memory Reserved: 0
   Memory Allocated: 0
  0%|          | 0/250 [00:00<?, ?it/s]/project/6045847/user/project/env/lib/python3.11/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
  2%|▏         | 6/250 [12:49<8:41:07, 128.14s/it]

1 GPU output (~19h30)

train
GPU 0:
   Total Memory: 34072559616
   Memory Reserved: 5809111040
   Memory Allocated: 5716698624
  1%|          | 2/250 [09:18<19:03:52, 276.74s/it]

and this is the code used to show VRAM usage:

def show_cuda_memory():
    for i in range(torch.cuda.device_count()):
        print(f"GPU {i}:")
        print(f"   Total Memory: {torch.cuda.get_device_properties(i).total_memory}")
        print(f"   Memory Reserved: {torch.cuda.memory_reserved(i)}")
        print(f"   Memory Allocated: {torch.cuda.memory_allocated(i)}")

I'm trying to understand why my code performs best with 2 GPUs instead of 4. Additionally, based on my console outputs, it seems that only the first GPU is being used during trainer.train(). I'm wondering how I can verify the GPU utilization from my Python code, since I'm in an environment (slurm) where I cannot use an external command like nvidia-smi in a separate console.

BenjaminBossan commented 3 days ago

Glad that you got it running, but I'm not sure why you see the bad scaling behavior. Just minor issue I spotted in your code, but that's unlikely to be the cause: When you pass peft_config to SFTTrainer, there is no need to call model = get_peft_model(model, peft_config), as SFTTrainer does that under the hood, so please remove that line.

huggingface / accelerate