Mixtral sft error - Githubissues

fancyerii commented 10 months ago

I am following Fine-tuning with 🤗 TRL and run with:

accelerate launch --config_file examples/accelerate_configs/multi_gpu.yaml --num_processes=4 \
    examples/scripts/sft.py \
    --model_name /nas/lili/models_hf/Mixtral-8x7B-Instruct-v0.1 \
    --dataset_name trl-lib/ultrachat_200k_chatml \
    --batch_size 2 \
    --gradient_accumulation_steps 1 \
    --learning_rate 2e-4 \
    --save_steps 200_000 \
    --use_peft \
    --peft_lora_r 16 --peft_lora_alpha 32 \
    --target_modules q_proj k_proj v_proj o_proj \
    --load_in_4bit \
    --output output \
    --use_auth_token false

It throws:

Traceback (most recent call last):
  File "/nas/lili/codes/pt/ft/trl/examples/scripts/sft.py", line 158, in <module>
    trainer.train()
  File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/trl/trainer/sft_trainer.py", line 315, in train
    output = super().train(*args, **kwargs)
  File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/transformers/trainer.py", line 1821, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/accelerate/data_loader.py", line 448, in __iter__
    current_batch = next(dataloader_iter)
  File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/transformers/data/data_collator.py", line 45, in __call__  
    return self.torch_call(features)
  File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/transformers/data/data_collator.py", line 732, in torch_call
    batch = self.tokenizer.pad(examples, return_tensors="pt", pad_to_multiple_of=self.pad_to_multiple_of)
  File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 3259, in pad
    padding_strategy, _, max_length, _ = self._get_padding_truncation_strategies(
  File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2707, in _get_padding_truncation_strategies
    raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

Besides this error, there are warning like:

/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/trl/trainer/sft_trainer.py:282: UserWarning: You passed a token
izer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when tra
ining a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code.

my env:

torch                    2.1.2
transformers             4.36.2
trl                      0.7.8.dev0
accelerate               0.25.0
peft                     0.7.1
bitsandbytes             0.41.3.post2

lvwerra commented 10 months ago

Have you tried the suggestion in the error message?

ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

This usually solves this issue.

fancyerii commented 10 months ago

After I added tokenizer.pad_token = tokenizer.eos_token. This error do not occur. But I got oom. Maybe My GPU card is only 40GB memory. So I config accelerate to use fsdp, the following is my config:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: true
  fsdp_sharding_strategy: 1
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

When I ran with this config, it throw ValueError: Cannot flatten integer dtype tensors. And I found this issue. It says fsdp can't use load_in_4bit. So I delete the load_in_4bit option. But it throws another error "unable to mmap 4865559944 bytes from file .safetensors Cannot allocate memory".

  File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/transformers/modeling_utils.py", line 503, in load_state_dict
    state_dict = load_state_dict(shard_file)
  File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/transformers/modeling_utils.py", line 503, in load_state_dict
    with safe_open(checkpoint_file, framework="pt") as f:
    with safe_open(checkpoint_file, framework="pt") as f:RuntimeError
: unable to mmap 4865559944 bytes from file </nas/lili/models_hf/Mixtral-8x7B-Instruct-v0.1/model-00028-of-00039.safetensors>: Cannot allocate memory (12)RuntimeError
: unable to mmap 4865559944 bytes from file </nas/lili/models_hf/Mixtral-8x7B-Instruct-v0.1/model-00028-of-00039.safetensors>: Cannot allocate memory (12)
    state_dict = load_state_dict(shard_file)

And I found this issue. I have modify dataloader_num_workers=0.

training_args = TrainingArguments(
    output_dir=script_args.output_dir,
    per_device_train_batch_size=script_args.batch_size,
    gradient_accumulation_steps=script_args.gradient_accumulation_steps,
    learning_rate=script_args.learning_rate,
    logging_steps=script_args.logging_steps,
    num_train_epochs=script_args.num_train_epochs,
    max_steps=script_args.max_steps,
    report_to=script_args.report_to,
    save_steps=script_args.save_steps,
    save_total_limit=script_args.save_total_limit,
    push_to_hub=script_args.push_to_hub,
    hub_model_id=script_args.hub_model_id,
    gradient_checkpointing=script_args.gradient_checkpointing,
    dataloader_num_workers=0
    # TODO: uncomment that on the next release
    # gradient_checkpointing_kwargs=script_args.gradient_checkpointing_kwargs,
)

But it still throws the same exception.

fancyerii commented 10 months ago

I checked again and 8 processes used up all 1TB memory. Why the need so much memory? I have already configed to use SHARDED_STATE_DICT and fsdp_sharding_strategy=1

fancyerii commented 10 months ago

it occured when loading model, not loading data.

lvwerra commented 10 months ago

Just to confirm: you have a machine with 8x40GB (A100?) GPUs? Have you tried with a smaller model first? E.g. Mistral-7b?

We tested the following on a 1x80GB A100:

accelerate launch --config_file examples/accelerate_configs/multi_gpu.yaml --num_processes=1 \
    examples/scripts/sft.py \
    --model_name mistralai/Mixtral-8x7B-v0.1 \
    --dataset_name trl-lib/ultrachat_200k_chatml \
    --batch_size 2 \
    --gradient_accumulation_steps 1 \
    --learning_rate 2e-4 \
    --save_steps 200_000 \
    --use_peft \
    --peft_lora_r 16 --peft_lora_alpha 32 \
    --target_modules q_proj k_proj v_proj o_proj \
    --load_in_4bit

cc @younesbelkada

fancyerii commented 10 months ago

yes, I have a machine with 8 a100 40gb. I can run small model(mistral 7b) with one gpu:

CUDA_VISIBLE_DEVICES=1 accelerate launch --config_file examples/accelerate_configs/multi_gpu.yaml --num_processes=1 \
    examples/scripts/sft.py \
    --model_name /nas/lili/models_hf/mistral-7b/ \
    --dataset_name trl-lib/ultrachat_200k_chatml \
    --batch_size 1 \
    --gradient_accumulation_steps 1 \
    --learning_rate 2e-4 \
    --save_steps 200_000 \
    --use_peft \
    --peft_lora_r 16 --peft_lora_alpha 32 \
    --target_modules q_proj k_proj v_proj o_proj \
    --load_in_4bit

But when I switch to Mixtral-8x7B, it ran with oom.

CUDA_VISIBLE_DEVICES=1 accelerate launch --config_file examples/accelerate_configs/multi_gpu.yaml --num_processes=1 \
    examples/scripts/sft.py \
    --model_name /nas/lili/models_hf/Mixtral-8x7B-Instruct-v0.1 \
    --dataset_name trl-lib/ultrachat_200k_chatml \
    --batch_size 1 \
    --gradient_accumulation_steps 1 \
    --learning_rate 2e-4 \
    --save_steps 200_000 \
    --use_peft \
    --peft_lora_r 16 --peft_lora_alpha 32 \
    --target_modules q_proj k_proj v_proj o_proj \
    --load_in_4bit

error message:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 224.00 MiB is free. Including non-PyTorch memory, this process has 39.16 GiB memory in use. Of the allocated memory 38.28 GiB is allocated by PyTorch, and 242.54 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

younesbelkada commented 10 months ago

Hi @fancyerii ! You are using a 40GB GPU and we used a 80GB GPU, can you try to enable gradient checkpointing (which is set to False by default) by passing --gradient_checkpointing in the command line?

fancyerii commented 10 months ago

@younesbelkada when add --gradient_checkpointing, another error throws:

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a metho
d to encode the text followed by a call to the `pad` method to get a padded encoding.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: pleas
e pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future.
 To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on
 the differences between the two variants.
  warnings.warn(
Traceback (most recent call last):
  File "/nas/lili/codes/pt/ft/trl/examples/scripts/sft.py", line 159, in <module>
    trainer.train()
  File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/trl/trainer/sft_trainer.py", line 315, in train
    output = super().train(*args, **kwargs)
  File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/transformers/trainer.py", line 1854, in _inner_training
_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/transformers/trainer.py", line 2744, in training_step
    self.accelerator.backward(loss)
  File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/accelerate/accelerator.py", line 1905, in backward
    loss.backward(**kwargs)
  File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
  File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 288, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
  File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 288, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 255 with name base_model.model.model.layers.31.self_attn.o_proj.lora_B.default.weight has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.
  0%|          | 0/623595 [00:02<?, ?it/s]
[2024-01-12 15:55:33,964] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 176801) of binary: /home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/bin/python
Traceback (most recent call last):
  File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    multi_gpu_launcher(args)
  File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

younesbelkada commented 10 months ago

can you pass {"use_reentrant": False} into gradient_checkpointing_kwargs in TrainingArguments ?

fancyerii commented 10 months ago

@younesbelkada After add this argument, it can run. But it's very very slow. it estimated about 400+ hours to complete. I think I still need to use multiple gpus to speed up.

fancyerii commented 10 months ago

@younesbelkada How long did you train with a gpu 80GB card for this example?

younesbelkada commented 10 months ago

@fancyerii this really depends on your dataset, you can also train on less steps. Do you confirm your training is running on all your GPUs?

fancyerii commented 9 months ago

@younesbelkada if I use one gpu, it needs 400+ hours. When I use all 8 gpus, it needs 50+ hours. I am using the mixtral example:

CUDA_VISIBLE_DEVICES=1 accelerate launch --config_file examples/accelerate_configs/multi_gpu.yaml --num_processes=1 \
    examples/scripts/sft.py \
    --model_name /nas/lili/models_hf/Mixtral-8x7B-Instruct-v0.1 \
    --dataset_name trl-lib/ultrachat_200k_chatml \
    --batch_size 1 \
    --gradient_accumulation_steps 1 \
    --learning_rate 2e-4 \
    --save_steps 200_000 \
    --use_peft \
    --peft_lora_r 16 --peft_lora_alpha 32 \
    --target_modules q_proj k_proj v_proj o_proj \
    --load_in_4bit

younesbelkada commented 9 months ago

@fancyerii - this is expected no? If you use 8GPUs then the training time gets correctly split across all GPUs no?

fancyerii commented 9 months ago

@younesbelkada I just want to know the training time of 1 a100 80G gpu because I can compare with it.

saeedkhaki92 commented 9 months ago

@younesbelkada thanks for this solution. I am using accelerate multi-gpu config and it is working well for Mixtral . My GPUs are 8 A-100 40G. However, It goes OOM if seq length is larger than 1024 which is small, I need at least 2048. I have enabled gradient checkpointing, decreasing batch size to 1, and using paged adamw 8bit. Still, it goes OOM. Is there any thing else I can do? I am not sure if multi-gpu config allows for CPU offload like deepspeed. I really appreciate if you could help? Thanks

younesbelkada commented 9 months ago

Hi @saeedkhaki92 Please see my comment on the other issue ! let me know how it goes

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

orellavie1212 commented 7 months ago

examples/scripts/sft.py

really weird you don't get oom from the start (loading the models after prepare, even before starting to train), I get oom with deepspeed 3 zero ++ an oom for this setup, which is weird, how multi gpu (which is just ddp) you won't get oom and I with deepspeed 3 does (with model sharding ofc)

huggingface / trl

Mixtral sft error #1175