More memory consumption than litgpt

getao commented 4 months ago

System Info

transformers=4.40.1 pytorch=2.2.1 deepspeed=0.14.1 accelerate=0.29.0

Who can help?

@pacman100 Hello, I tried fine-tuning LLMs using transformers' trainer with deepspeed zero-3 or fsdp. While I find it is very easy to use, it seems to cause more memory consumption (e.g., than litgpt's fsdp).

For example, when I fine-tune a 7b LLM (with bf16, flashattn, max context length=2048), with transformers' code (with zero-3 or fsdp), batch=2 (local) leads to OOM on a 8X40GB A100 node; while with litgpt, the (local) batch size can be set to 6 without OOM.

My fine-tuning script is as follows:

deepspeed --num_gpus=$gpu code/train.py --adam_beta2 0.99 --adam_epsilon 1e-8 --num_train_epochs $epoch --per_device_train_batch_size $batch --per_device_eval_batch_size $batch --gradient_accumulation_steps $accumulation_steps --gradient_checkpointing \
          --learning_rate $lr --warmup_steps 30  --max_grad_norm 2.0 --seed $seed --data_seed $seed --logging_steps 10 --save_strategy 'epoch' --evaluation_strategy 'epoch' \
          --bf16 --output_dir $OUT_DIR --logging_dir $OUT_DIR --model_name_or_path $PRETRAINED_MODEL_PATH --tokenizer_name $TOKENIZER_PATH \
           --deepspeed zero3.json

The training code is also straightforward:

def train_model(model, train_dataset, eval_dataset, training_args, data_collator=None):

    last_checkpoint = None
    if os.path.isdir(training_args.output_dir):
        last_checkpoint = get_last_checkpoint(training_args.output_dir)
        if last_checkpoint is not None and training_args.resume_from_checkpoint is None:
            print(
                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
            )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=data_collator
    )

    checkpoint = None
    if training_args.resume_from_checkpoint is not None:
        checkpoint = training_args.resume_from_checkpoint
    elif last_checkpoint is not None:
        checkpoint = last_checkpoint

    print(f"Loaded from the checkpoint: {checkpoint}")

    train_result = trainer.train(resume_from_checkpoint=checkpoint)

    trainer.save_model()
    trainer.log_metrics("train", train_result.metrics)
    metrics = trainer.evaluate()
    trainer.log_metrics("eval", metrics)
    trainer.save_metrics("eval", metrics)

model = AutoModelForCausalLM.from_pretrained(model_args.model_name_or_path, torch_dtype=bf16, use_flash_attention_2=True, resume_download=True)
train_model(model, train_dataset, eval_dataset, training_args)

Could I know whether there is something wrong with my code that leads to the additional memory cost?

Thank you

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

It can be easily run with my provided code and script.

Expected behavior

local batch size can be increased to 6 without OOM

amyeroberts commented 2 months ago

cc @muellerzr @SunMarc

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker commented 1 week ago

I would recommend you to try setting the liger kernels! 🤗 #32889

huggingface / transformers