Closed janphilippfranken closed 2 months ago
It seems that the mixtral model is loaded to your every GPU rather than partitioned to your GPU (A100) equally.
model = AutoModelForCausalLM.from_pretrained(
model=$model_name_or_model_path,
torch_dtype=torch.float16,
)
Try this as above.
Thanks a lot for the isue!
I second what @maywind23 said - Mixtral is quite large and I don't think it'll fit on a single A100 GPU for inference. You will need to load it with device_map="auto"
and simply run python xxx.py
For increasing the 4bit model performance I advise you to look into LoftQ initialization technique: https://huggingface.co/docs/peft/main/en/developer_guides/lora#loftq in order to boost the performance of your model. Could you try that out as well?
@younesbelkada Thanks for this solution. I am using accelerate multi-gpu config and it is working well for Mixtral using DPO. My GPUs are 8 A-100 40G. However, It goes OOM if seq length is larger than 1024 which is small, I need at least 2048. I have enabled gradient checkpointing, decreasing batch size to 1, and using paged adamw 8bit. Still, it goes OOM. Is there any thing else I can do? I am not sure if multi-gpu config allows for CPU offload like deepspeed. I really appreciate if you could help? Thanks
@janphilippfranken Did deepspeed work for you? It does not work for me.
not with mixtral; so i also ended up using device map auto and just run python train.py for mixtral (which seems v inefficient?)
for mistral etc it does.
On Thu, 1 Feb 2024 at 8:49 am, Saeed Khaki @.***> wrote:
@janphilippfranken https://github.com/janphilippfranken Did deepspeed work for you? It does not work for me.
— Reply to this email directly, view it on GitHub https://github.com/huggingface/trl/issues/1268#issuecomment-1921763479, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM6ZNJPDE2HGDQSESVOFS4LYRPBSXAVCNFSM6AAAAABCGPLGDSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRRG43DGNBXHE . You are receiving this because you were mentioned.Message ID: @.***>
I see, but it becomes very slow. It does not utilize all capacity of GPUs, e.g. the GPU utilization is low.
@saeedkhaki92 for decreasing the memory footprint of the training of your model you might consider using flash-attention 2 , simply pass attn_implementation="flash_attention_2"
in from_pretrained
. Make sure to use TRL built from main to include some important fixes with respect to DPO + FA2 + Mixtral: https://github.com/huggingface/trl/pull/1290
@younesbelkada Thanks. It still goes OOM, I added attn_implementation="flash_attention_2" and setting use_cache=False.
This is my training scripts and how I call it:
accelerate launch --config_file ./accelerate_configs/multi_gpu.yaml --num_processes=8 \
rlhf_dpo_4bit.py \
--model_name_or_path="/mnt/efs/workspace/sakhaki/models/Mixtral-8x7B-Instruct-v0.1" \
--output_dir="/mnt/efs/workspace/sakhaki/models/Mixtral-8x7B-dpo-v5" \
--data_path="/mnt/efs/workspace/sakhaki/data/mixtral_dpo_12858.json" \
--use_lamma2_peft_config False \
--beta 0.1 \
--optimizer_type adamw_bnb_8bit \
--learning_rate 2e-5 \
--warmup_steps 50 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 8 \
--lora_alpha 16 \
--lora_dropout 0.05 \
--lora_r 8 \
--max_prompt_length 1024 \
--max_length 2048 \
--num_train_epochs 4 \
--logging_steps 2 \
--save_steps 50 \
--save_total_limit 8 \
--eval_steps 10 \
--gradient_checkpointing True \
--report_to "wandb" \
--target_modules q_proj k_proj v_proj o_proj
And this is part of my code where I load the mixtral model inside my script: rlhf_dpo_4bit.py
quantization_config = BitsAndBytesConfig(
load_in_8bit=False, load_in_4bit=True
)
torch_dtype = torch.bfloat16
model = AutoModelForCausalLM.from_pretrained(
script_args.model_name_or_path,
quantization_config=quantization_config,
device_map=get_kbit_device_map(),
trust_remote_code=True,
use_cache=False,
torch_dtype=torch_dtype,
attn_implementation="flash_attention_2"
# use_auth_token=script_args.use_auth_token,
)
if script_args.ignore_bias_buffers:
# torch distributed hack
model._ddp_params_and_buffers_to_ignore = [
name for name, buffer in model.named_buffers() if buffer.dtype == torch.bool
]
model_ref=None
trl version: 0.7.11.dev0
@younesbelkada Could you please let us know if there is any other way around this? Like CPU offloading, as far as I know, accelerate does not have cpu offload options. I tried deepspeed but getting errors. Thanks a lot
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
@younesbelkada Update: I tried using deepspeed_zero2 config and adding cpu offload options, it is still going OOM, here is my zero config
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
gradient_accumulation_steps: 8
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
My understanding is that with zero2+offloading, we should not go OOM because of excessive memory would be assigned to CPU. I appreciate it if you could comment on this? Thanks
Hi @saeedkhaki92
sadly I can't really tell .. zero-2 could give OOMs theoretically and the only solution would be to go for Zero-3 but is not supported by bitsandbytes / QLoRA. The other option could be to restrict the target modules to a smaller set, e.g. by removing o_proj
and only restricting it to qkv layers
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
@younesbelkada Just a quick update, I managed to get it working with zero3+offloading, and by adding:
from deepspeed.utils import set_z3_leaf_modules
from transformers.models.mixtral.modeling_mixtral import MixtralSparseMoeBlock
set_z3_leaf_modules(model, [MixtralSparseMoeBlock])
it significantly reduced the memory usage. Per documentation of DeepSpeed: set_z3_leaf_modules is particularly useful in the context of Mixture of Experts (MoE) models. In MoE models, the computation order of experts varies across forward passes. This variability can disrupt ZeRO3's functionality, as ZeRO3 relies on tracking the computation order of modules to prefetch parameters efficiently. By designating a module as a 'leaf' node, ZeRO3 will prefetch parameters for all child modules upon entering the module.
very nice thanks for sharing! Note also now QLoRA + DS-Zero3 is compatible if you use the latest transformers / accelerate: https://huggingface.co/docs/peft/accelerate/deepspeed
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
hi!
i am trying to use the dpo trainer to fine-tune a mixtral 8*7B model in 16bit precision (i've already completed fine-tuning for a 4bit model without issues, but unfortunately the quantized adapter performs worse than the 16bit version of the model which i want to compare it to).
my goal is to complete training an adapter in 16bit precision, and then merge and unload the model with the adapter to run inference with vllm using the merged model.
unfortunately, i am running into OOM issues when trying to run
dpo_trainer.train()
for the following setup (any help would be much appreciated):deepspeed config: (from https://huggingface.co/blog/accelerate-deepspeed)
accelerate config
training script:
Hardware: 4 80GB A100 GPUs
Command:
accelerate launch --config_file accelerate_config.yaml train_dpo.py
error File "/scr/jphilipp/miniconda3/envs/scai-tuning/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1210, in all_gather_coalesced param_buffer = torch.empty( torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 2 has a total capacty of 79.15 GiB of which 171.25 MiB is free. Including non-PyTorch memory, this process has 78.87 GiB memory in use. Of the allocated memory 76.47 GiB is allocated by PyTorch, and 1.00 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF