Closed Excuses123 closed 1 year ago
Here's another question, in the new version of the Transformers package, the default loaded model by from_pretrained has become safeTensors. How can I change it to pytorch.bin? Is there any parameter I can specify?
Hi @Excuses123, thanks for raising this issue.
Without knowing the model or dataset, we're unable to reproduce and won't be able to debug this issue. Is there a minimal reproducible snippet with a public dataset and model checkpoint where this issue (increase memory footprint) still occurs and you could share?
To force the model to not load safetensor weights you can pass use_safetensors=False
in the from_pretrained
call
@amyeroberts Thank you for your response.
I am using the model: bigscience/bloomz-1b1
The data can be found at: https://huggingface.co/datasets/BelleGroup/train_0.5M_CN/blob/main/Belle_open_source_0.5M.json
Below is the execution script:
torchrun --nproc_per_node=4 --master_port=12345 train.py \
--model_name_or_path bigscience/bloomz-1b1 \
--cache_dir /workspace/pretrain_model/bloomz \
--output_dir /workspace/finetune_model/bloomz/bloomz_1b1_sft \
--data_path /workspace/datasets/Belle_train_0.5M_CN/Belle_open_source_0.5M.json \
--fp16 True \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 32 \
--model_max_length 512 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'BloomBlock' \
--report_to "tensorboard"
After testing, The maximum version that can currently run is 4.29.2, and all versions after that cannot run.
I guess it might be caused by FSDP (Fully Sharded Data Parallelism), but I'm not sure.
@Excuses123 Have you tried running without FDSP? Which version of accelerate are you running?
@amyeroberts I have tried it, and without FSDP, both the new and old versions of transformers throw an OOM error. My accelerate version is 0.20.3.
both the new and old versions of transformers throw an OOM error.
@Excuses123 Is this including versions <= 4.29.2 ?
@amyeroberts I have tried version 4.29.0 and it works
@Excuses123 OK, thanks for confirming.
Could you:
``` code goes here ```
@amyeroberts I have fixed the code formatting, and the version of my datasets is 2.11.0. My machine is currently running a task, and as soon as it is finished, I will try the latest version.
Facing the same issue. Code ran smoothly with transformers==4.28.1 but OOM with transformers==4.30.2
@Excuses123 @larrylawl OK, thanks for the information and updates.
I'm going to cc @pacman100 and @younesbelkada who know more about training in fp16 and torchrun
I can confirm this. It is a bug introduced recently. It can be reproduced by the Vicuna training example. The script works well for 4.28.1 but hits OOM with 4.31.0.
With 4.31.0, the warning is
FSDP Warning: When using FSDP, it is efficient and recommended to call prepare for the model before creating the optimizer
FSDP Warning: When using FSDP, several parameter groups will be conflated into a single one due to nested module wrapping and parameter flattening.
To fix it, I followed the guide and changed these lines (https://github.com/huggingface/transformers/blob/e42587f596181396e1c4b63660abf0c736b10dae/src/transformers/trainer.py#L1646-L1661) to
model = self.accelerator.prepare(model)
if delay_optimizer_creation:
self.create_optimizer_and_scheduler(num_training_steps=max_steps)
self.optimizer = self.accelerator.prepare(self.optimizer)
Then the warnings and OOM disappeared.
@pacman100 @younesbelkada I think my fix is a hack that only works for my case. Could you do a more complete fix in the main branch?
Hello @Ying1123, Thank you for the detailed info, very helpful. Could you please try out the above PRs for accelerate and transformers and see if it fixes the OOM?
Hello @Ying1123, Thank you for the detailed info, very helpful. Could you please try out the above PRs for accelerate and transformers and see if it fixes the OOM?
Thanks @pacman100, cherry-pick the PRs for transformers v4.31.0 and accelerate v0.21.0 works for me.
@pacman100 Hi, I am still getting out-of-memory issues with the latest main. With transformer==4.28.1, the vicuna-7b example can run on 4xA100 (40GB) without any issues.
After accelerate is used for FSDP (from v4.30 - the current main), the example hits OOM. Before your fix, the example hits OOM immediately. After your fix, the example hits OOM after a few batches.
From these observations, I can confirm that the recent refactoring makes the memory usage higher than the older version but I do not know how to debug because I am not familiar with Accelerate. Could you do more testing and help us fix it? This blocks us from updating transformers to the latest version.
Hello @merrymercy, can you post the vram usage with the 4.28 version?
Hi @pacman100 @Ying1123 , I meet the same issus: OOM ; And I revised my tranfomers to 4.31.0 or 4.30.0 and accelerate=0.21.0, all these are not worked ! On 2 x A6000 48G, fine-tuning LLaMA 7B With transformer=4.31.0, accelerate=0.22.0.dev0 (latest main), the warning is:
FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
FSDP Warning: When using FSDP, it is efficient and recommended to call prepare for the model before creating the optimizer.
FSDP Warning: When using FSDP, several parameter groups will be conflated into a single one due to nested module wrapping and parameter flattening.
And my fsdp are:
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
@pacman100 @Ying1123 And I found another way to add the fsdp_config.json can disappear the all follow warning :
FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
And hacking method can disappear:
FSDP Warning: When using FSDP, it is efficient and recommended to call prepare for the model before creating the optimizer.
FSDP Warning: When using FSDP, several parameter groups will be conflated into a single one due to nested module wrapping and parameter flattening.
But all these still hit on OOM ! My fsdp_config.json is:
{
"fsdp_auto_wrap_policy": "FULL_SHARD",
"fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer"
}
I think there is better way to fix this.
I see same memory usage across versions for the following example:
cd transformers
export TASK_NAME=mrpc
torchrun --nnodes 1 --nproc-per-node 2 ./examples/pytorch/text-classification/run_glue.py --model_name_or_path bert-base-cased --task_name $TASK_NAME --do_train --do_eval --max_seq_length 128 --per_device_train_batch_size 16 --learning_rate 5e-5 --num_train_epochs 3 --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap BertLayer --bf16
version 4.28.1 - 5.4GB vram latest main branch - 4.8GB vram
Please provide a minimal example that I can directly run without having to spend time in getting it to work.
You mean the transformers=the latest main branch; accelerate=0.21.0 ?
Both Accelerate and Transformers main branch
With both Accelerate and Transformers main branch works for me
@Xuekai-Zhu did you fix the problem? i met the same oom as 2xA6000 with both main branch
I confirm using @Ying1123 's hacking does not work for me. I have 4 A100 card, with transformers==4.31.0, accelerator==0.21.0
.
due to this method. downgrade to transformer==4.28.1 worked for me
@pacman100 Hi, I am still getting out-of-memory issues with the latest main. With transformer==4.28.1, the vicuna-7b example can run on 4xA100 (40GB) without any issues.
After accelerate is used for FSDP (from v4.30 - the current main), the example hits OOM. Before your fix, the example hits OOM immediately. After your fix, the example hits OOM after a few batches.
From these observations, I can confirm that the recent refactoring makes the memory usage higher than the older version but I do not know how to debug because I am not familiar with Accelerate. Could you do more testing and help us fix it? This blocks us from updating transformers to the latest version.
I tried all the solution still getting OOM on A100 80GB
If you still have an issue I suggest you to create a new issue, share a reproducer, a traceback and ping @pacman100, otherwise there is no way we can help you 😓
System Info
transformers
version: 4.29.0Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Here is my code.
Expected behavior
Has anyone encountered this problem? I used the same instruction fine-tuning code. It runs successfully with transformers package version 4.29.0, but when I upgrade to version 4.30.2, it fails to run and throws an OOM (Out of Memory) error. Does anyone know the reason behind this?
Below is the GPU status during my successful run.