huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.1k stars 27.03k forks source link

set fsdp and bf16 don't save memory #22821

Closed skye95git closed 1 year ago

skye95git commented 1 year ago

System Info

Who can help?

@ArthurZucker @sgu

Information

Tasks

Reproduction

  1. download the dataset
    
    lang = "Python"

import subprocess subprocess.call(["wget", f"https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/{lang}.zip"]) subprocess.call(["unzip", f"/content/{lang}.zip"])

!mkdir "log" log_dir = "/content/log" !mkdir "data" data_dir = "/content/data" !mkdir "model" model_dir = "/content/model" !mkdir "tokenizer" tokenizer_dir = "/content/tokenizer"


2. data preprocess

import os import json import torch from pathlib import Path from transformers import (Trainer, pipeline, RobertaConfig, TrainingArguments, RobertaForMaskedLM, RobertaTokenizerFast, LineByLineTextDataset, DataCollatorForLanguageModeling)

from tokenizers import ByteLevelBPETokenizer from tokenizers.processors import BertProcessing from tokenizers.implementations import ByteLevelBPETokenizer

def prepare_text(dir_path): for path in os.listdir(dir_path): os.system(f"gunzip -k {dir_path}/{path}")

texts = "" for path in os.listdir(dir_path): if path.endswith(".jsonl"): with open(dir_path + "/" + path, 'r') as f: sample_file = f.readlines() for sample in sample_file: obj = json.loads(sample) texts += obj["original_string"].replace("\n", "").replace("\t", "") + "\n" return texts

train1_texts = prepare_text(f"/content/{lang}/final/jsonl/train") train2_texts = prepare_text(f"/content/{lang}/final/jsonl/valid") train_texts = train1_texts + "\n" + train2_texts valid_texts = prepare_text(f"/content/{lang}/final/jsonl/test")

for path, text in zip(["train_texts.txt", "valid_texts.txt"], [train_texts, valid_texts]): with open(f"{data_dir}/{path}","w") as f: f.write(text)


3. Train a tokenizer

paths = [str(x) for x in Path(f"{data_dir}/").glob("*/.txt")] tokenizer = ByteLevelBPETokenizer()

tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[ "", "", "", "", "", ])

tokenizer.save_model(tokenizer_dir)

tokenizer = ByteLevelBPETokenizer( "tokenizer/vocab.json", "tokenizer/merges.txt", )

tokenizer._tokenizer.post_processor = BertProcessing( ("", tokenizer.token_to_id("")), ("", tokenizer.token_to_id("")), ) tokenizer.enable_truncation(max_length=512)


4. Build model

config = RobertaConfig( vocab_size=52_000, max_position_embeddings=514, num_attention_heads=12, num_hidden_layers=6, type_vocab_size=1, )

tokenizer = RobertaTokenizerFast.from_pretrained(tokenizer_dir, max_len=512)

model = RobertaForMaskedLM(config=config) model.num_parameters()

train_dataset = LineByLineTextDataset( tokenizer=tokenizer, file_path=f"{data_dir}/train_texts.txt", block_size=128, )

test_dataset = LineByLineTextDataset( tokenizer=tokenizer, file_path=f"{data_dir}/valid_texts.txt", block_size=128, )

data_collator = DataCollatorForLanguageModeling( tokenizer=tokenizer, mlm=True, mlm_probability=0.15 )

training_args = TrainingArguments( output_dir=model_dir, overwrite_output_dir=True, num_train_epochs=4, per_gpu_train_batch_size=64, save_steps=5000, do_eval=True, logging_dir=log_dir, )

trainer = Trainer( model=model, args=training_args, data_collator=data_collator, train_dataset=train_dataset, eval_dataset = test_dataset )

trainer.train()

trainer.save_model(model_dir)

tokenizer.save_pretrained(tokenizer_dir)


### Expected behavior

before set fsdp and bf16:

training_args = TrainingArguments( output_dir=model_dir, overwrite_output_dir=True, num_train_epochs=4, per_gpu_train_batch_size=64, save_steps=5000, do_eval=True, logging_dir=log_dir, )

<img width="417" alt="Snipaste_2023-04-18_15-42-22" src="https://user-images.githubusercontent.com/41561936/232707188-2579965b-92fd-4ba6-87de-b82ca948ec54.png">

after set fsdp and bf16:

training_args = TrainingArguments( output_dir=model_dir, overwrite_output_dir=True, num_train_epochs=4, per_gpu_train_batch_size=64, save_steps=5000, do_eval=True, logging_dir=log_dir, fsdp=True, bf16=True, )


<img width="415" alt="Snipaste_2023-04-18_15-42-45" src="https://user-images.githubusercontent.com/41561936/232707483-2b89c658-172d-4a23-a7fc-fe40cd1dfe83.png">

The memory usage is not much different and does not achieve the desired effect. Why?

I also try to set `per_gpu_train_batch_size=4` when `fsdp=True, bf16=True`:
<img width="426" alt="Snipaste_2023-04-18_15-49-23" src="https://user-images.githubusercontent.com/41561936/232708818-efa676d9-4e6b-440a-b0e0-e66e54026da5.png">

Compared with the results of the previous set of experiments, the increase of memory usage is much greater than the increase of batch size. Why?
amyeroberts commented 1 year ago

cc @younesbelkada

younesbelkada commented 1 year ago

cc @pacman100 as I am not really familiar with FSDP + Trainer yet

pacman100 commented 1 year ago

Hello @skye95git, you are using FSDP incorrectly, just setting fsdp=True won't reduce memory usage. Please refer:

  1. the docs here if you want to use Trainer's arguments: https://huggingface.co/docs/transformers/main_classes/trainer#pytorch-fully-sharded-data-parallel
  2. the docs here if you want to use the accelerate launch with trainer: https://huggingface.co/docs/transformers/main/en/main_classes/trainer#using-accelerate-launcher-with-trainer
bastia0321 commented 1 year ago

Hello @skye95git, you are using FSDP incorrectly, just setting fsdp=True won't reduce memory usage. Please refer:

  1. the docs here if you want to use Trainer's arguments: https://huggingface.co/docs/transformers/main_classes/trainer#pytorch-fully-sharded-data-parallel
  2. the docs here if you want to use the accelerate launch with trainer: https://huggingface.co/docs/transformers/main/en/main_classes/trainer#using-accelerate-launcher-with-trainer

Hi @pacman100 thanks for the reply here. However, from https://github.com/huggingface/transformers/blob/fe861e578f50dc9c06de33cd361d2f625017e624/src/transformers/trainer.py#L1526C1-L1526C39 it seems that only when XLA enables FSDP, is this correct? If fsdp_config['xla'] is None, how FSDP is used in this version?

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.