Closed VRSupriya closed 1 year ago
The optimizer states and parameters make up the bulk of the model state. Because you have offloaded both to the CPU the added benefit you will get by training across multiple GPUs is not memory reduction but an ability to increase batch size.
This question deserves a better answer. I've been trying to do exactly the same thing and it appears deepspeed just makes the memory requirements larger. I've been trying for months to do distributed training with things like falcon-7b with no luck. I have 6 nodes each with a 24gb gpu. I found a model that would work but it's tiny and even though the deepspeed calculator says it should only need 34gb total, all 6 nodes use over 20gb of vram for a total of 120gb. I also feel like I am missing something obvious. I don't think deepspeed zero3 actually saves any memory consumption.
I am attempting Multi-node training of Falcon 7B with Peft using DeepSpeed and Accelerate. During single node training, it takes up 39GB of GPU memory. However, in multi-node training, both machines consume 40GB of memory. Shouldn't it reduce memory usage?
Expected behavior Reduce the memory usage
ds_report output
[2023-07-31 08:56:06,593] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja .................. [OKAY]
op name ................ installed .. compatible
async_io ............... [YES] ...... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [YES] ...... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0 [WARNING] using untested triton version (2.1.0+9e3e10c5ed), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY]
DeepSpeed general environment info: torch install path ............... ['/anaconda3/envs/venv/lib/python3.10/site-packages/torch'] torch version .................... 2.0.0+cu117 deepspeed install path ........... ['/anaconda3/envs/venv/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.10.0, unknown, unknown torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 12.0 deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7
System info (please complete the following information):
Launcher context
accelerate launch --config_file ds_zero3_multinode.yaml run_clm_no_trainer_lora.py --model_name_or_path "tiiuae/falcon-7b" --dataset_name "train.json" --block_size 2048 --learning_rate 3e-5 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --num_train_epochs 10 --num_warmup_steps 2000 --checkpointing_steps 1000 --preprocessing_num_workers 8 --with_tracking --output_dir "output/lora_test" --report_to "tensorboard"
The accelerate congiruation file
compute_environment: LOCAL_MACHINE deepspeed_config: deepspeed_config_file: zero_stage3_offload_config.json deepspeed_hostfile: /path/to/hostfile deepspeed_multinode_launcher: pdsh zero3_init_flag: false distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 0 main_process_ip:
main_process_port:
main_training_function: main
num_machines: 2
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
host file machine1 slots=1 machine2 slots=1
Deepspeed conf file : ds_zero3_multinode.yaml
{ "fp16": { "enabled": false, "loss_scale": 1024, "loss_scale_window": 1000, "initial_scale_power": 4, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled":true }, "optimizer": { "type": "Adamw", "params": { "lr": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e7, "stage3_max_reuse_distance": 1e7, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 1, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": true }
code
import argparse import json import logging import math import os import random from itertools import chain from pathlib import Path
import datasets import torch from accelerate import Accelerator, DistributedType from accelerate.logging import get_logger from accelerate.utils import set_seed from datasets import load_dataset from huggingface_hub import Repository, create_repo from torch.utils.data import DataLoader from tqdm.auto import tqdm import psutil import gc import threading import transformers from transformers import ( CONFIG_MAPPING, MODEL_MAPPING, AutoConfig, AutoModelForCausalLM, AutoTokenizer, SchedulerType, default_data_collator, get_scheduler, ) from transformers.utils import check_min_version, get_full_repo_name, send_example_telemetry from transformers.utils.versions import require_version from peft import LoraConfig, TaskType, get_peft_model from peft.utils.other import fsdp_auto_wrap_policy from accelerate.utils import DummyOptim,DummyScheduler from deepspeed.runtime.utils import see_memory_usage
logger = get_logger(name)
MODEL_CONFIG_CLASSES = list(MODEL_MAPPING.keys()) MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)
def b2mb(x): return int(x / 2**20)
def parse_args(): parser = argparse.ArgumentParser(description="Finetune a transformers model on a causal language modeling task") parser.add_argument( "--dataset_name", type=str, default=None, help="The name of the dataset to use (via the datasets library).", ) parser.add_argument( "--dataset_config_name", type=str, default=None, help="The configuration name of the dataset to use (via the datasets library).", ) parser.add_argument( "--train_file", type=str, default=None, help="A csv or a json file containing the training data." ) parser.add_argument( "--validation_file", type=str, default=None, help="A csv or a json file containing the validation data." ) parser.add_argument( "--validation_split_percentage", default=1, help="The percentage of the train set used as validation set in case there's no validation split", ) parser.add_argument( "--model_name_or_path", type=str, help="Path to pretrained model or model identifier from huggingface.co/models.", required=False, ) parser.add_argument( "--config_name", type=str, default=None, help="Pretrained config name or path if not the same as model_name", ) parser.add_argument( "--tokenizer_name", type=str, default=None, help="Pretrained tokenizer name or path if not the same as model_name", ) parser.add_argument( "--use_slow_tokenizer", action="store_true", help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).", ) parser.add_argument( "--per_device_train_batch_size", type=int, default=8, help="Batch size (per device) for the training dataloader.", ) parser.add_argument( "--per_device_eval_batch_size", type=int, default=8, help="Batch size (per device) for the evaluation dataloader.", ) parser.add_argument( "--learning_rate", type=float, default=5e-5, help="Initial learning rate (after the potential warmup period) to use.", ) parser.add_argument("--weight_decay", type=float, default=0.0, help="Weight decay to use.") parser.add_argument("--num_train_epochs", type=int, default=3, help="Total number of training epochs to perform.") parser.add_argument( "--max_train_steps", type=int, default=None, help="Total number of training steps to perform. If provided, overrides num_train_epochs.", ) parser.add_argument( "--gradient_accumulation_steps", type=int, default=1, help="Number of updates steps to accumulate before performing a backward/update pass.", ) parser.add_argument( "--lr_scheduler_type", type=SchedulerType, default="linear", help="The scheduler type to use.", choices=["linear", "cosine", "cosine_with_restarts", "polynomial", "constant", "constant_with_warmup"], ) parser.add_argument( "--num_warmup_steps", type=int, default=0, help="Number of steps for the warmup in the lr scheduler." ) parser.add_argument("--output_dir", type=str, default=None, help="Where to store the final model.") parser.add_argument("--seed", type=int, default=None, help="A seed for reproducible training.") parser.add_argument( "--model_type", type=str, default=None, help="Model type to use if training from scratch.", choices=MODEL_TYPES, ) parser.add_argument( "--block_size", type=int, default=None, help=( "Optional input sequence length after tokenization. The training dataset will be truncated in block of" " this size for training. Default to the model max input length for single sentence inputs (take into" " account special tokens)." ), ) parser.add_argument("--use_group_texts",action="store_true") parser.add_argument( "--preprocessing_num_workers", type=int, default=None, help="The number of processes to use for the preprocessing.", ) parser.add_argument( "--overwrite_cache", action="store_true", help="Overwrite the cached training and evaluation sets" ) parser.add_argument( "--no_keep_linebreaks", action="store_true", help="Do not keep line breaks when using TXT files." ) parser.add_argument("--push_to_hub", action="store_true", help="Whether or not to push the model to the Hub.") parser.add_argument( "--hub_model_id", type=str, help="The name of the repository to keep in sync with the local
output_dir
." ) parser.add_argument("--hub_token", type=str, help="The token to use to push to the Model Hub.") parser.add_argument( "--checkpointing_steps", type=str, default=None, help="Whether the various states should be saved at the end of every n steps, or 'epoch' for each epoch.", ) parser.add_argument( "--resume_from_checkpoint", type=str, default=None, help="If the training should continue from a checkpoint folder.", ) parser.add_argument( "--with_tracking", action="store_true", help="Whether to enable experiment trackers for logging.", ) parser.add_argument( "--report_to", type=str, default="all", help=( 'The integration to report the results and logs to. Supported platforms are"tensorboard"
,' '"wandb"
,"comet_ml"
and"clearml"
. Use"all"
(default) to report to all integrations.' "Only applicable when--with_tracking
is passed." ), ) args = parser.parse_args()Sanity checks
def main(): args = parse_args()
if name == "main": main()
Log log.txt