Closed fancyerii closed 8 months ago
Have you tried the suggestion in the error message?
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.
This usually solves this issue.
After I added tokenizer.pad_token = tokenizer.eos_token. This error do not occur. But I got oom. Maybe My GPU card is only 40GB memory. So I config accelerate to use fsdp, the following is my config:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: false
fsdp_offload_params: true
fsdp_sharding_strategy: 1
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
When I ran with this config, it throw ValueError: Cannot flatten integer dtype tensors. And I found this issue. It says fsdp can't use load_in_4bit. So I delete the load_in_4bit option. But it throws another error "unable to mmap 4865559944 bytes from file .safetensors Cannot allocate memory".
File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/transformers/modeling_utils.py", line 503, in load_state_dict
state_dict = load_state_dict(shard_file)
File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/transformers/modeling_utils.py", line 503, in load_state_dict
with safe_open(checkpoint_file, framework="pt") as f:
with safe_open(checkpoint_file, framework="pt") as f:RuntimeError
: unable to mmap 4865559944 bytes from file </nas/lili/models_hf/Mixtral-8x7B-Instruct-v0.1/model-00028-of-00039.safetensors>: Cannot allocate memory (12)RuntimeError
: unable to mmap 4865559944 bytes from file </nas/lili/models_hf/Mixtral-8x7B-Instruct-v0.1/model-00028-of-00039.safetensors>: Cannot allocate memory (12)
state_dict = load_state_dict(shard_file)
And I found this issue. I have modify dataloader_num_workers=0.
training_args = TrainingArguments(
output_dir=script_args.output_dir,
per_device_train_batch_size=script_args.batch_size,
gradient_accumulation_steps=script_args.gradient_accumulation_steps,
learning_rate=script_args.learning_rate,
logging_steps=script_args.logging_steps,
num_train_epochs=script_args.num_train_epochs,
max_steps=script_args.max_steps,
report_to=script_args.report_to,
save_steps=script_args.save_steps,
save_total_limit=script_args.save_total_limit,
push_to_hub=script_args.push_to_hub,
hub_model_id=script_args.hub_model_id,
gradient_checkpointing=script_args.gradient_checkpointing,
dataloader_num_workers=0
# TODO: uncomment that on the next release
# gradient_checkpointing_kwargs=script_args.gradient_checkpointing_kwargs,
)
But it still throws the same exception.
I checked again and 8 processes used up all 1TB memory. Why the need so much memory? I have already configed to use SHARDED_STATE_DICT and fsdp_sharding_strategy=1
it occured when loading model, not loading data.
Just to confirm: you have a machine with 8x40GB (A100?) GPUs? Have you tried with a smaller model first? E.g. Mistral-7b?
We tested the following on a 1x80GB A100:
accelerate launch --config_file examples/accelerate_configs/multi_gpu.yaml --num_processes=1 \
examples/scripts/sft.py \
--model_name mistralai/Mixtral-8x7B-v0.1 \
--dataset_name trl-lib/ultrachat_200k_chatml \
--batch_size 2 \
--gradient_accumulation_steps 1 \
--learning_rate 2e-4 \
--save_steps 200_000 \
--use_peft \
--peft_lora_r 16 --peft_lora_alpha 32 \
--target_modules q_proj k_proj v_proj o_proj \
--load_in_4bit
cc @younesbelkada
yes, I have a machine with 8 a100 40gb. I can run small model(mistral 7b) with one gpu:
CUDA_VISIBLE_DEVICES=1 accelerate launch --config_file examples/accelerate_configs/multi_gpu.yaml --num_processes=1 \
examples/scripts/sft.py \
--model_name /nas/lili/models_hf/mistral-7b/ \
--dataset_name trl-lib/ultrachat_200k_chatml \
--batch_size 1 \
--gradient_accumulation_steps 1 \
--learning_rate 2e-4 \
--save_steps 200_000 \
--use_peft \
--peft_lora_r 16 --peft_lora_alpha 32 \
--target_modules q_proj k_proj v_proj o_proj \
--load_in_4bit
But when I switch to Mixtral-8x7B, it ran with oom.
CUDA_VISIBLE_DEVICES=1 accelerate launch --config_file examples/accelerate_configs/multi_gpu.yaml --num_processes=1 \
examples/scripts/sft.py \
--model_name /nas/lili/models_hf/Mixtral-8x7B-Instruct-v0.1 \
--dataset_name trl-lib/ultrachat_200k_chatml \
--batch_size 1 \
--gradient_accumulation_steps 1 \
--learning_rate 2e-4 \
--save_steps 200_000 \
--use_peft \
--peft_lora_r 16 --peft_lora_alpha 32 \
--target_modules q_proj k_proj v_proj o_proj \
--load_in_4bit
error message:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 224.00 MiB is free. Including non-PyTorch memory, this process has 39.16 GiB memory in use. Of the allocated memory 38.28 GiB is allocated by PyTorch, and 242.54 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Hi @fancyerii !
You are using a 40GB GPU and we used a 80GB GPU, can you try to enable gradient checkpointing (which is set to False by default) by passing --gradient_checkpointing
in the command line?
@younesbelkada when add --gradient_checkpointing, another error throws:
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a metho
d to encode the text followed by a call to the `pad` method to get a padded encoding.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: pleas
e pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future.
To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on
the differences between the two variants.
warnings.warn(
Traceback (most recent call last):
File "/nas/lili/codes/pt/ft/trl/examples/scripts/sft.py", line 159, in <module>
trainer.train()
File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/trl/trainer/sft_trainer.py", line 315, in train
output = super().train(*args, **kwargs)
File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/transformers/trainer.py", line 1854, in _inner_training
_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/transformers/trainer.py", line 2744, in training_step
self.accelerator.backward(loss)
File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/accelerate/accelerator.py", line 1905, in backward
loss.backward(**kwargs)
File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 288, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 288, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 255 with name base_model.model.model.layers.31.self_attn.o_proj.lora_B.default.weight has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration.
0%| | 0/623595 [00:02<?, ?it/s]
[2024-01-12 15:55:33,964] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 176801) of binary: /home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/bin/python
Traceback (most recent call last):
File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
can you pass {"use_reentrant": False}
into gradient_checkpointing_kwargs
in TrainingArguments
?
@younesbelkada After add this argument, it can run. But it's very very slow. it estimated about 400+ hours to complete. I think I still need to use multiple gpus to speed up.
@younesbelkada How long did you train with a gpu 80GB card for this example?
@fancyerii this really depends on your dataset, you can also train on less steps. Do you confirm your training is running on all your GPUs?
@younesbelkada if I use one gpu, it needs 400+ hours. When I use all 8 gpus, it needs 50+ hours. I am using the mixtral example:
CUDA_VISIBLE_DEVICES=1 accelerate launch --config_file examples/accelerate_configs/multi_gpu.yaml --num_processes=1 \
examples/scripts/sft.py \
--model_name /nas/lili/models_hf/Mixtral-8x7B-Instruct-v0.1 \
--dataset_name trl-lib/ultrachat_200k_chatml \
--batch_size 1 \
--gradient_accumulation_steps 1 \
--learning_rate 2e-4 \
--save_steps 200_000 \
--use_peft \
--peft_lora_r 16 --peft_lora_alpha 32 \
--target_modules q_proj k_proj v_proj o_proj \
--load_in_4bit
@fancyerii - this is expected no? If you use 8GPUs then the training time gets correctly split across all GPUs no?
@younesbelkada I just want to know the training time of 1 a100 80G gpu because I can compare with it.
@younesbelkada thanks for this solution. I am using accelerate multi-gpu config and it is working well for Mixtral . My GPUs are 8 A-100 40G. However, It goes OOM if seq length is larger than 1024 which is small, I need at least 2048. I have enabled gradient checkpointing, decreasing batch size to 1, and using paged adamw 8bit. Still, it goes OOM. Is there any thing else I can do? I am not sure if multi-gpu config allows for CPU offload like deepspeed. I really appreciate if you could help? Thanks
Hi @saeedkhaki92 Please see my comment on the other issue ! let me know how it goes
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
examples/scripts/sft.py
really weird you don't get oom from the start (loading the models after prepare, even before starting to train), I get oom with deepspeed 3 zero ++ an oom for this setup, which is weird, how multi gpu (which is just ddp) you won't get oom and I with deepspeed 3 does (with model sharding ofc)
I am following Fine-tuning with 🤗 TRL and run with:
It throws:
Besides this error, there are warning like:
my env: