OOM when using Deepspeed ZERO-3 to train a galactica 30b model.

Describe the bug I try to use deepspeed ZERO-3 with huggingface Trainer to finetune a galactica 30b model (gpt-2 like), with 4 nodes, each 4 A100 gpu. I get oom error though the model should fit into 16 A100 with Zero 3 and cpu offload. Previously I have successfully trained a 6.7b model on 1 node, and 2 nodes respectively.

The final part of the error report is (the full log file is TLDR attached at the end of this post):

gpu-q-13: ret = input.softmax(dim, dtype=dtype)
gpu-q-13:     ret = input.softmax(dim, dtype=dtype)
gpu-q-13: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB (GPU 2; 79.17 GiB total capacity; 76.68 GiB already allocated; 381.31 MiB free; 77.24 GiB reserved in total by PyTorch)

interestingly no matter how many nodes I use (1,2, or 4), the memory report line is always: MA 0.0 GB Max_MA 0.0 GB CA 55.83 GB Max_CA 56 GB i.e. the Max_CA always stays the same

My ds_config.json:

{
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",

    "optimizer": {
       "type": "AdamW",
       "params": {
        "lr": "auto",
        "betas": "auto",
        "eps": "auto",
        "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "fp16": {
        "enabled": true,
        "auto_cast": false,
        "loss_scale": 0,
        "initial_scale_power": 16,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    }
}

My code is very simple:

tokenized_datasets = load_from_disk(dataset_path) # This is pre-tokenized
print(tokenized_datasets)

tokenizer = AutoTokenizer.from_pretrained(tokenizer_path,cache_dir=cache_dir) # This matches model config. Should be the same used in "tokenize_dataset"

model = OPTForCausalLM.from_pretrained(model_path,cache_dir=cache_dir, torch_dtype=torch.float16, low_cpu_mem_usage=True) 

training_args = TrainingArguments(output_dir=save_model_path, evaluation_strategy = "epoch", deepspeed=ds_config_file_path,
    num_train_epochs=Epoch, 
    # per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    adam_beta1=0.9,
    adam_beta2=0.95,
    weight_decay=0.1,
    learning_rate=8e-5,
    max_grad_norm = 1.0,
    warmup_steps = 60,
    fp16=True,
    auto_find_batch_size=True,
    save_strategy = 'epoch',
    save_total_limit = 3,
    load_best_model_at_end = True,
    )
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)
trainer.train()

ds_report output

JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
No CUDA runtime is found, using CUDA_HOME='/usr/local/software/cuda/11.4'
DeepSpeed general environment info:
torch install path ............... ['/rds/project/rds-lSmP1cwRttU/lz429/projects/galactica/venv_transformers/lib64/python3.8/site-packages/torch']
torch version .................... 1.12.1+cu116
deepspeed install path ........... ['/rds/project/rds-lSmP1cwRttU/lz429/projects/galactica/venv_transformers/lib64/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.8.1, unknown, unknown
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.4
deepspeed wheel compiled w. ...... torch 1.12, cuda 10.2

System info:

OS: CentOS8
4x NVIDIA A100-SXM-80GB GPUs per node
Dual-rail Mellanox HDR200 InfiniBand interconnect
python 3.8.12

Launcher context deepspeed train.py

full log file (TLDR)

JobID: 14744604
======
Time: Wed Feb 22 14:20:02 GMT 2023
Running on master node: gpu-q-13
Current directory: /home/lz429/rds/rds-t2-cs151-lSmP1cwRttU/lz429/projects/galactica

Nodes allocated:
================
gpu-q-13 gpu-q-49

numtasks=2, numnodes=2, mpi_tasks_per_node=1 (OMP_NUM_THREADS=1)

Executing command:
==================
python3.8 create_hostfile.py --machine_file machine_files/machine.file.14744604 --hostfile hostfiles/hostfile_14744604; deepspeed --num_nodes=2 --hostfile=hostfiles/hostfile_14744604 train.py

[2023-02-22 14:20:36,598] [INFO] [runner.py:454:main] Using IP address of 10.43.74.19 for node gpu-q-13
[2023-02-22 14:20:36,606] [INFO] [multinode_runner.py:65:get_cmd] Running on the following workers: gpu-q-13,gpu-q-49
[2023-02-22 14:20:36,606] [INFO] [runner.py:548:main] cmd = pdsh -S -f 1024 -w gpu-q-13,gpu-q-49 export PYTHONDONTWRITEBYTECODE=1; export PYTHONPATH=/rds/project/rds-lSmP1cwRttU/lz429/projects/galactica;  cd /rds/project/rds-lSmP1cwRttU/lz429/projects/galactica; /rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/bin/python3.8 -u -m deepspeed.launcher.launch --world_info=eyJncHUtcS0xMyI6IFswLCAxLCAyLCAzXSwgImdwdS1xLTQ5IjogWzAsIDEsIDIsIDNdfQ== --node_rank=%n --master_addr=10.43.74.19 --master_port=29500 train.py
gpu-q-13: [2023-02-22 14:20:44,177] [INFO] [launch.py:142:main] WORLD INFO DICT: {'gpu-q-13': [0, 1, 2, 3], 'gpu-q-49': [0, 1, 2, 3]}
gpu-q-13: [2023-02-22 14:20:44,177] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=4, node_rank=0
gpu-q-13: [2023-02-22 14:20:44,177] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'gpu-q-13': [0, 1, 2, 3], 'gpu-q-49': [4, 5, 6, 7]})
gpu-q-13: [2023-02-22 14:20:44,177] [INFO] [launch.py:162:main] dist_world_size=8
gpu-q-13: [2023-02-22 14:20:44,177] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
gpu-q-49: [2023-02-22 14:20:46,510] [INFO] [launch.py:142:main] WORLD INFO DICT: {'gpu-q-13': [0, 1, 2, 3], 'gpu-q-49': [0, 1, 2, 3]}
gpu-q-49: [2023-02-22 14:20:46,510] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=4, node_rank=1
gpu-q-49: [2023-02-22 14:20:46,510] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'gpu-q-13': [0, 1, 2, 3], 'gpu-q-49': [4, 5, 6, 7]})
gpu-q-49: [2023-02-22 14:20:46,510] [INFO] [launch.py:162:main] dist_world_size=8
gpu-q-49: [2023-02-22 14:20:46,510] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3

gpu-q-13: [2023-02-22 14:30:01,010] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
gpu-q-49: Using cuda_amp half precision backend
gpu-q-13: Using cuda_amp half precision backend
gpu-q-13: The following columns in the training set don't have a corresponding argument in `OPTForCausalLM.forward` and have been ignored: token_type_ids. If token_type_ids are not expected by `OPTForCausalLM.forward`,  you can safely ignore this message.
gpu-q-49: The following columns in the training set don't have a corresponding argument in `OPTForCausalLM.forward` and have been ignored: token_type_ids. If token_type_ids are not expected by `OPTForCausalLM.forward`,  you can safely ignore this message.
gpu-q-13: [2023-02-22 14:30:03,012] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.8.0, git-hash=unknown, git-branch=unknown
gpu-q-13: [2023-02-22 14:30:12,560] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
gpu-q-49: [2023-02-22 14:30:12,848] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
gpu-q-49: [2023-02-22 14:30:12,848] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
gpu-q-49: [2023-02-22 14:30:12,849] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
gpu-q-49: [2023-02-22 14:30:12,853] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
gpu-q-13: [2023-02-22 14:30:12,854] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
gpu-q-13: [2023-02-22 14:30:12,854] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
gpu-q-13: [2023-02-22 14:30:12,855] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
gpu-q-13: [2023-02-22 14:30:12,855] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
gpu-q-49: Installed CUDA version 11.4 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
gpu-q-49: Installed CUDA version 11.4 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
gpu-q-13: Installed CUDA version 11.4 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
gpu-q-13: Installed CUDA version 11.4 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
gpu-q-13: Installed CUDA version 11.4 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
gpu-q-49: Installed CUDA version 11.4 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
gpu-q-49: Installed CUDA version 11.4 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
gpu-q-13: Installed CUDA version 11.4 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
gpu-q-13: Using /home/lz429/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
gpu-q-13: Using /home/lz429/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
gpu-q-49: Using /home/lz429/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
gpu-q-49: Using /home/lz429/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
gpu-q-13: Using /home/lz429/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
gpu-q-49: Using /home/lz429/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
gpu-q-49: Using /home/lz429/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
gpu-q-13: Using /home/lz429/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
gpu-q-13: Detected CUDA files, patching ldflags
gpu-q-13: Emitting ninja build file /home/lz429/.cache/torch_extensions/py38_cu117/cpu_adam/build.ninja...
gpu-q-13: Building extension module cpu_adam...
gpu-q-13: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
gpu-q-13: ninja: no work to do.
gpu-q-13: Loading extension module cpu_adam...
gpu-q-13: Loading extension module cpu_adam...
gpu-q-13: Time to load cpu_adam op: 0.5527675151824951 seconds
gpu-q-13: Time to load cpu_adam op: 0.5527572631835938 seconds
gpu-q-13: Loading extension module cpu_adam...
gpu-q-13: Time to load cpu_adam op: 0.5637121200561523 seconds
gpu-q-13: Loading extension module cpu_adam...
gpu-q-13: Time to load cpu_adam op: 0.4925413131713867 seconds
gpu-q-49: Loading extension module cpu_adam...Loading extension module cpu_adam...
gpu-q-49: 
gpu-q-49: Time to load cpu_adam op: 3.3906493186950684 seconds
gpu-q-49: Time to load cpu_adam op: 3.3641982078552246 seconds
gpu-q-49: Loading extension module cpu_adam...
gpu-q-49: Time to load cpu_adam op: 3.457759141921997 seconds
gpu-q-49: Loading extension module cpu_adam...
gpu-q-49: Time to load cpu_adam op: 3.4651191234588623 seconds
gpu-q-13: Adam Optimizer #0 is created with AVX2 arithmetic capability.
gpu-q-13: Config: alpha=0.000080, betas=(0.900000, 0.950000), weight_decay=0.100000, adam_w=1
gpu-q-13: [2023-02-22 14:30:17,247] [INFO] [logging.py:68:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
gpu-q-13: [2023-02-22 14:30:17,311] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
gpu-q-13: [2023-02-22 14:30:17,311] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
gpu-q-13: [2023-02-22 14:30:17,311] [INFO] [logging.py:68:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer
gpu-q-13: Using /home/lz429/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
gpu-q-13: Using /home/lz429/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...Using /home/lz429/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
gpu-q-13: 
gpu-q-13: [2023-02-22 14:30:17,526] [INFO] [utils.py:831:see_memory_usage] Stage 3 initialize beginning
gpu-q-13: [2023-02-22 14:30:17,527] [INFO] [utils.py:832:see_memory_usage] MA 55.83 GB         Max_MA 55.83 GB         CA 55.83 GB         Max_CA 56 GB 
gpu-q-13: [2023-02-22 14:30:17,527] [INFO] [utils.py:840:see_memory_usage] CPU Virtual Memory:  used = 35.36 GB, percent = 3.5%
gpu-q-13: [2023-02-22 14:30:17,530] [INFO] [stage3.py:114:__init__] Reduce bucket size 51380224
gpu-q-13: [2023-02-22 14:30:17,530] [INFO] [stage3.py:115:__init__] Prefetch bucket size 46242201
gpu-q-13: Using /home/lz429/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
gpu-q-13: Emitting ninja build file /home/lz429/.cache/torch_extensions/py38_cu117/utils/build.ninja...
gpu-q-13: Building extension module utils...
gpu-q-13: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
gpu-q-13: ninja: no work to do.
gpu-q-13: Loading extension module utils...
gpu-q-13: Time to load utils op: 0.1780397891998291 seconds
gpu-q-13: Loading extension module utils...
gpu-q-13: Time to load utils op: 0.10374855995178223 seconds
gpu-q-13: Loading extension module utils...
gpu-q-13: Loading extension module utils...
gpu-q-13: Time to load utils op: 0.22209572792053223 seconds
gpu-q-13: Time to load utils op: 0.22223544120788574 seconds
gpu-q-13: [2023-02-22 14:30:17,692] [INFO] [utils.py:831:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
gpu-q-13: [2023-02-22 14:30:17,692] [INFO] [utils.py:832:see_memory_usage] MA 55.83 GB         Max_MA 55.83 GB         CA 55.83 GB         Max_CA 56 GB 
gpu-q-13: [2023-02-22 14:30:17,693] [INFO] [utils.py:840:see_memory_usage] CPU Virtual Memory:  used = 35.75 GB, percent = 3.5%
gpu-q-49: Adam Optimizer #0 is created with AVX2 arithmetic capability.
gpu-q-49: Config: alpha=0.000080, betas=(0.900000, 0.950000), weight_decay=0.100000, adam_w=1
gpu-q-49: Using /home/lz429/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
gpu-q-49: Using /home/lz429/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
gpu-q-49: Using /home/lz429/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
gpu-q-49: Using /home/lz429/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
gpu-q-49: Emitting ninja build file /home/lz429/.cache/torch_extensions/py38_cu117/utils/build.ninja...
gpu-q-49: Building extension module utils...
gpu-q-49: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
gpu-q-49: ninja: no work to do.
gpu-q-49: Loading extension module utils...
gpu-q-49: Time to load utils op: 0.18081092834472656 seconds
gpu-q-49: Loading extension module utils...
gpu-q-49: Time to load utils op: 0.20454859733581543 seconds
gpu-q-49: Loading extension module utils...
gpu-q-49: Time to load utils op: 0.20326709747314453 seconds
gpu-q-49: Loading extension module utils...
gpu-q-49: Time to load utils op: 0.2035524845123291 seconds
gpu-q-13: Parameter Offload: Total persistent parameters: 4487168 in 482 params
gpu-q-13: [2023-02-22 14:30:21,079] [INFO] [utils.py:831:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
gpu-q-13: [2023-02-22 14:30:21,080] [INFO] [utils.py:832:see_memory_usage] MA 0.0 GB         Max_MA 55.83 GB         CA 55.83 GB         Max_CA 56 GB 
gpu-q-13: [2023-02-22 14:30:21,080] [INFO] [utils.py:840:see_memory_usage] CPU Virtual Memory:  used = 72.17 GB, percent = 7.2%
gpu-q-13: [2023-02-22 14:30:21,151] [INFO] [utils.py:831:see_memory_usage] Before creating fp16 partitions
gpu-q-13: [2023-02-22 14:30:21,151] [INFO] [utils.py:832:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 55.83 GB         Max_CA 56 GB 
gpu-q-13: [2023-02-22 14:30:21,152] [INFO] [utils.py:840:see_memory_usage] CPU Virtual Memory:  used = 72.56 GB, percent = 7.2%
gpu-q-13: [2023-02-22 14:30:25,938] [INFO] [utils.py:831:see_memory_usage] After creating fp16 partitions: 4
gpu-q-13: [2023-02-22 14:30:25,939] [INFO] [utils.py:832:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 55.83 GB         Max_CA 56 GB 
gpu-q-13: [2023-02-22 14:30:25,939] [INFO] [utils.py:840:see_memory_usage] CPU Virtual Memory:  used = 107.67 GB, percent = 10.7%
gpu-q-13: [2023-02-22 14:30:26,004] [INFO] [utils.py:831:see_memory_usage] Before creating fp32 partitions
gpu-q-13: [2023-02-22 14:30:26,005] [INFO] [utils.py:832:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 55.83 GB         Max_CA 56 GB 
gpu-q-13: [2023-02-22 14:30:26,005] [INFO] [utils.py:840:see_memory_usage] CPU Virtual Memory:  used = 109.26 GB, percent = 10.8%
gpu-q-13: [2023-02-22 14:30:28,440] [INFO] [utils.py:831:see_memory_usage] After creating fp32 partitions
gpu-q-13: [2023-02-22 14:30:28,441] [INFO] [utils.py:832:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 55.83 GB         Max_CA 56 GB 
gpu-q-13: [2023-02-22 14:30:28,441] [INFO] [utils.py:840:see_memory_usage] CPU Virtual Memory:  used = 161.77 GB, percent = 16.1%
gpu-q-13: [2023-02-22 14:30:28,734] [INFO] [utils.py:831:see_memory_usage] Before initializing optimizer states
gpu-q-13: [2023-02-22 14:30:28,735] [INFO] [utils.py:832:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 55.83 GB         Max_CA 56 GB 
gpu-q-13: [2023-02-22 14:30:28,735] [INFO] [utils.py:840:see_memory_usage] CPU Virtual Memory:  used = 163.11 GB, percent = 16.2%
gpu-q-13: [2023-02-22 14:30:41,927] [INFO] [utils.py:831:see_memory_usage] After initializing optimizer states
gpu-q-13: [2023-02-22 14:30:41,928] [INFO] [utils.py:832:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 55.83 GB         Max_CA 56 GB 
gpu-q-13: [2023-02-22 14:30:41,928] [INFO] [utils.py:840:see_memory_usage] CPU Virtual Memory:  used = 338.98 GB, percent = 33.7%
gpu-q-13: [2023-02-22 14:30:41,936] [INFO] [stage3.py:382:_setup_for_real_optimizer] optimizer state initialized
gpu-q-49: Using /home/lz429/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
gpu-q-49: Using /home/lz429/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
gpu-q-13: Using /home/lz429/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
gpu-q-13: Using /home/lz429/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
gpu-q-49: No modifications detected for re-loaded extension module utils, skipping build step...
gpu-q-49: Loading extension module utils...
gpu-q-13: Using /home/lz429/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
gpu-q-49: Time to load utils op: 0.0017540454864501953 seconds
gpu-q-49: No modifications detected for re-loaded extension module utils, skipping build step...
gpu-q-49: Loading extension module utils...
gpu-q-13: No modifications detected for re-loaded extension module utils, skipping build step...
gpu-q-13: Loading extension module utils...
gpu-q-49: Time to load utils op: 0.0019550323486328125 seconds
gpu-q-13: No modifications detected for re-loaded extension module utils, skipping build step...
gpu-q-13: Loading extension module utils...
gpu-q-49: Using /home/lz429/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
gpu-q-13: Time to load utils op: 0.0020689964294433594 seconds
gpu-q-49: Using /home/lz429/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
gpu-q-13: Time to load utils op: 0.0014569759368896484 seconds
gpu-q-49: ***** Running training *****
gpu-q-49:   Num examples = 825
gpu-q-49:   Num Epochs = 10
gpu-q-13: No modifications detected for re-loaded extension module utils, skipping build step...
gpu-q-13: Loading extension module utils...
gpu-q-49:   Instantaneous batch size per device = 1
gpu-q-49:   Total train batch size (w. parallel, distributed & accumulation) = 32
gpu-q-49:   Gradient Accumulation steps = 4
gpu-q-49:   Total optimization steps = 260
gpu-q-13: Time to load utils op: 0.0016238689422607422 seconds
gpu-q-49: No modifications detected for re-loaded extension module utils, skipping build step...
gpu-q-49: Loading extension module utils...
gpu-q-49: No modifications detected for re-loaded extension module utils, skipping build step...
gpu-q-49: Loading extension module utils...
gpu-q-49: Time to load utils op: 0.0014286041259765625 seconds
gpu-q-49: Time to load utils op: 0.00141143798828125 seconds
gpu-q-49:   Number of trainable parameters = 0
gpu-q-13: [2023-02-22 14:30:50,119] [INFO] [utils.py:831:see_memory_usage] After initializing ZeRO optimizer
gpu-q-13: [2023-02-22 14:30:50,121] [INFO] [utils.py:832:see_memory_usage] MA 0.1 GB         Max_MA 1.43 GB         CA 57.17 GB         Max_CA 57 GB 
gpu-q-13: [2023-02-22 14:30:50,121] [INFO] [utils.py:840:see_memory_usage] CPU Virtual Memory:  used = 371.62 GB, percent = 36.9%
gpu-q-13: [2023-02-22 14:30:50,121] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
gpu-q-13: [2023-02-22 14:30:50,121] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = WarmupLR
gpu-q-13: [2023-02-22 14:30:50,121] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x146d5a85afd0>
gpu-q-13: [2023-02-22 14:30:50,121] [INFO] [logging.py:68:log_dist] [Rank 0] step=0, skipped=0, lr=[8e-05], mom=[[0.9, 0.95]]
gpu-q-13: [2023-02-22 14:30:50,122] [INFO] [config.py:1008:print] DeepSpeedEngine configuration:
gpu-q-13: [2023-02-22 14:30:50,122] [INFO] [config.py:1012:print]   activation_checkpointing_config  {
gpu-q-13:     "partition_activations": false, 
gpu-q-13:     "contiguous_memory_optimization": false, 
gpu-q-13:     "cpu_checkpointing": false, 
gpu-q-13:     "number_checkpoints": null, 
gpu-q-13:     "synchronize_checkpoint_boundary": false, 
gpu-q-13:     "profile": false
gpu-q-13: }
gpu-q-13: [2023-02-22 14:30:50,122] [INFO] [config.py:1012:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
gpu-q-13: [2023-02-22 14:30:50,122] [INFO] [config.py:1012:print]   amp_enabled .................. False
gpu-q-13: [2023-02-22 14:30:50,122] [INFO] [config.py:1012:print]   amp_params ................... False
gpu-q-13: [2023-02-22 14:30:50,122] [INFO] [config.py:1012:print]   autotuning_config ............ {
gpu-q-13:     "enabled": false, 
gpu-q-13:     "start_step": null, 
gpu-q-13:     "end_step": null, 
gpu-q-13:     "metric_path": null, 
gpu-q-13:     "arg_mappings": null, 
gpu-q-13:     "metric": "throughput", 
gpu-q-13:     "model_info": null, 
gpu-q-13:     "results_dir": "autotuning_results", 
gpu-q-13:     "exps_dir": "autotuning_exps", 
gpu-q-13:     "overwrite": true, 
gpu-q-13:     "fast": true, 
gpu-q-13:     "start_profile_step": 3, 
gpu-q-13:     "end_profile_step": 5, 
gpu-q-13:     "tuner_type": "gridsearch", 
gpu-q-13:     "tuner_early_stopping": 5, 
gpu-q-13:     "tuner_num_trials": 50, 
gpu-q-13:     "model_info_path": null, 
gpu-q-13:     "mp_size": 1, 
gpu-q-13:     "max_train_batch_size": null, 
gpu-q-13:     "min_train_batch_size": 1, 
gpu-q-13:     "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
gpu-q-13:     "min_train_micro_batch_size_per_gpu": 1, 
gpu-q-13:     "num_tuning_micro_batch_sizes": 3
gpu-q-13: }
gpu-q-13: [2023-02-22 14:30:50,122] [INFO] [config.py:1012:print]   bfloat16_enabled ............. False
gpu-q-13: [2023-02-22 14:30:50,122] [INFO] [config.py:1012:print]   checkpoint_parallel_write_pipeline  False
gpu-q-13: [2023-02-22 14:30:50,122] [INFO] [config.py:1012:print]   checkpoint_tag_validation_enabled  True
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   checkpoint_tag_validation_fail  False
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x1478d00c3790>
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   communication_data_type ...... None
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   curriculum_enabled_legacy .... False
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   curriculum_params_legacy ..... False
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   data_efficiency_enabled ...... False
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   dataloader_drop_last ......... False
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   disable_allgather ............ False
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   dump_state ................... False
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   eigenvalue_enabled ........... False
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   eigenvalue_gas_boundary_resolution  1
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   eigenvalue_layer_name ........ bert.encoder.layer
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   eigenvalue_layer_num ......... 0
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   eigenvalue_max_iter .......... 100
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   eigenvalue_stability ......... 1e-06
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   eigenvalue_tol ............... 0.01
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   eigenvalue_verbose ........... False
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   elasticity_enabled ........... False
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   flops_profiler_config ........ {
gpu-q-13:     "enabled": false, 
gpu-q-13:     "profile_step": 1, 
gpu-q-13:     "module_depth": -1, 
gpu-q-13:     "top_modules": 1, 
gpu-q-13:     "detailed": true, 
gpu-q-13:     "output_file": null
gpu-q-13: }
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   fp16_auto_cast ............... False
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   fp16_enabled ................. True
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   fp16_master_weights_and_gradients  False
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   global_rank .................. 0
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   grad_accum_dtype ............. None
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   gradient_accumulation_steps .. 4
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   gradient_clipping ............ 1.0
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   gradient_predivide_factor .... 1.0
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   initial_dynamic_scale ........ 65536
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   load_universal_checkpoint .... False
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   loss_scale ................... 0
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   memory_breakdown ............. False
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   monitor_config ............... <deepspeed.monitor.config.DeepSpeedMonitorConfig object at 0x1478d00c3ac0>
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   nebula_config ................ {
gpu-q-13:     "enabled": false, 
gpu-q-13:     "persistent_storage_path": null, 
gpu-q-13:     "persistent_time_interval": 100, 
gpu-q-13:     "num_of_version_in_retention": 2, 
gpu-q-13:     "enable_nebula_load": true, 
gpu-q-13:     "load_path": null
gpu-q-13: }
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   optimizer_legacy_fusion ...... False
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   optimizer_name ............... adamw
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   optimizer_params ............. {'lr': 8e-05, 'betas': [0.9, 0.95], 'eps': 1e-08, 'weight_decay': 0.1}
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   pld_enabled .................. False
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   pld_params ................... False
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   prescale_gradients ........... False
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   scheduler_name ............... WarmupLR
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 8e-05, 'warmup_num_steps': 60}
gpu-q-13: [2023-02-22 14:30:50,123] [INFO] [config.py:1012:print]   sparse_attention ............. None
gpu-q-13: [2023-02-22 14:30:50,124] [INFO] [config.py:1012:print]   sparse_gradients_enabled ..... False
gpu-q-13: [2023-02-22 14:30:50,124] [INFO] [config.py:1012:print]   steps_per_print .............. 10
gpu-q-13: [2023-02-22 14:30:50,124] [INFO] [config.py:1012:print]   train_batch_size ............. 32
gpu-q-13: [2023-02-22 14:30:50,124] [INFO] [config.py:1012:print]   train_micro_batch_size_per_gpu  1
gpu-q-13: [2023-02-22 14:30:50,124] [INFO] [config.py:1012:print]   use_node_local_storage ....... False
gpu-q-13: [2023-02-22 14:30:50,124] [INFO] [config.py:1012:print]   wall_clock_breakdown ......... False
gpu-q-13: [2023-02-22 14:30:50,124] [INFO] [config.py:1012:print]   world_size ................... 8
gpu-q-13: [2023-02-22 14:30:50,124] [INFO] [config.py:1012:print]   zero_allow_untested_optimizer  False
gpu-q-13: [2023-02-22 14:30:50,126] [INFO] [config.py:1012:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=51380224 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=46242201 param_persistence_threshold=71680 model_persistence_threshold=sys.maxsize max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False
gpu-q-13: [2023-02-22 14:30:50,126] [INFO] [config.py:1012:print]   zero_enabled ................. True
gpu-q-13: [2023-02-22 14:30:50,126] [INFO] [config.py:1012:print]   zero_optimization_stage ...... 3
gpu-q-13: [2023-02-22 14:30:50,126] [INFO] [config.py:997:print_user_config]   json = {
gpu-q-13:     "train_micro_batch_size_per_gpu": 1, 
gpu-q-13:     "gradient_accumulation_steps": 4, 
gpu-q-13:     "gradient_clipping": 1.0, 
gpu-q-13:     "optimizer": {
gpu-q-13:         "type": "AdamW", 
gpu-q-13:         "params": {
gpu-q-13:             "lr": 8e-05, 
gpu-q-13:             "betas": [0.9, 0.95], 
gpu-q-13:             "eps": 1e-08, 
gpu-q-13:             "weight_decay": 0.1
gpu-q-13:         }
gpu-q-13:     }, 
gpu-q-13:     "scheduler": {
gpu-q-13:         "type": "WarmupLR", 
gpu-q-13:         "params": {
gpu-q-13:             "warmup_min_lr": 0, 
gpu-q-13:             "warmup_max_lr": 8e-05, 
gpu-q-13:             "warmup_num_steps": 60
gpu-q-13:         }
gpu-q-13:     }, 
gpu-q-13:     "fp16": {
gpu-q-13:         "enabled": true, 
gpu-q-13:         "auto_cast": false, 
gpu-q-13:         "loss_scale": 0, 
gpu-q-13:         "initial_scale_power": 16, 
gpu-q-13:         "loss_scale_window": 1000, 
gpu-q-13:         "hysteresis": 2, 
gpu-q-13:         "min_loss_scale": 1
gpu-q-13:     }, 
gpu-q-13:     "zero_optimization": {
gpu-q-13:         "stage": 3, 
gpu-q-13:         "offload_optimizer": {
gpu-q-13:             "device": "cpu", 
gpu-q-13:             "pin_memory": true
gpu-q-13:         }, 
gpu-q-13:         "offload_param": {
gpu-q-13:             "device": "cpu", 
gpu-q-13:             "pin_memory": true
gpu-q-13:         }, 
gpu-q-13:         "overlap_comm": true, 
gpu-q-13:         "contiguous_gradients": true, 
gpu-q-13:         "sub_group_size": 1.000000e+09, 
gpu-q-13:         "reduce_bucket_size": 5.138022e+07, 
gpu-q-13:         "stage3_prefetch_bucket_size": 4.624220e+07, 
gpu-q-13:         "stage3_param_persistence_threshold": 7.168000e+04, 
gpu-q-13:         "stage3_max_live_parameters": 1.000000e+09, 
gpu-q-13:         "stage3_max_reuse_distance": 1.000000e+09, 
gpu-q-13:         "stage3_gather_16bit_weights_on_model_save": true
gpu-q-13:     }
gpu-q-13: }
gpu-q-13: Using /home/lz429/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
gpu-q-13: No modifications detected for re-loaded extension module utils, skipping build step...
gpu-q-13: Loading extension module utils...
gpu-q-13: Time to load utils op: 0.0010693073272705078 seconds
gpu-q-13: ***** Running training *****
gpu-q-13:   Num examples = 825
gpu-q-13:   Num Epochs = 10
gpu-q-13:   Instantaneous batch size per device = 1
gpu-q-13:   Total train batch size (w. parallel, distributed & accumulation) = 32
gpu-q-13:   Gradient Accumulation steps = 4
gpu-q-13:   Total optimization steps = 260
gpu-q-13:   Number of trainable parameters = 0
gpu-q-49: 
  0%|          | 0/260 [00:00<?, ?it/s]/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
gpu-q-49:   warnings.warn(
gpu-q-49: /rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
gpu-q-49:   warnings.warn(
gpu-q-49: /rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
gpu-q-49:   warnings.warn(
gpu-q-49: /rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
gpu-q-49:   warnings.warn(
gpu-q-13: 
  0%|          | 0/260 [00:00<?, ?it/s]/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
gpu-q-13:   warnings.warn(
gpu-q-13: /rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
gpu-q-13:   warnings.warn(
gpu-q-13: /rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
gpu-q-13:   warnings.warn(
gpu-q-13: /rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
gpu-q-13:   warnings.warn(
gpu-q-49: Traceback (most recent call last):
gpu-q-49: Traceback (most recent call last):
gpu-q-49:   File "train.py", line 79, in <module>
gpu-q-49:   File "train.py", line 79, in <module>
gpu-q-49: Traceback (most recent call last):
gpu-q-49:   File "train.py", line 79, in <module>
gpu-q-49: Traceback (most recent call last):
gpu-q-49:   File "train.py", line 79, in <module>
gpu-q-49:         trainer.train()trainer.train()
gpu-q-49: 
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/trainer.py", line 1527, in train
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/trainer.py", line 1527, in train
gpu-q-49:     trainer.train()
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/trainer.py", line 1527, in train
gpu-q-49:     trainer.train()
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/trainer.py", line 1527, in train
gpu-q-49:     return inner_training_loop(
gpu-q-49:         return inner_training_loop(return inner_training_loop(
gpu-q-49: 
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/trainer.py", line 1775, in _inner_training_loop
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/trainer.py", line 1775, in _inner_training_loop
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/trainer.py", line 1775, in _inner_training_loop
gpu-q-49:     return inner_training_loop(
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/trainer.py", line 1775, in _inner_training_loop
gpu-q-49:     tr_loss_step = self.training_step(model, inputs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/trainer.py", line 2523, in training_step
gpu-q-49:     tr_loss_step = self.training_step(model, inputs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/trainer.py", line 2523, in training_step
gpu-q-49:     tr_loss_step = self.training_step(model, inputs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/trainer.py", line 2523, in training_step
gpu-q-49:     tr_loss_step = self.training_step(model, inputs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/trainer.py", line 2523, in training_step
gpu-q-49:     loss = self.compute_loss(model, inputs)
gpu-q-49:       File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/trainer.py", line 2555, in compute_loss
gpu-q-49: loss = self.compute_loss(model, inputs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/trainer.py", line 2555, in compute_loss
gpu-q-49:     loss = self.compute_loss(model, inputs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/trainer.py", line 2555, in compute_loss
gpu-q-49:     loss = self.compute_loss(model, inputs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/trainer.py", line 2555, in compute_loss
gpu-q-49:     outputs = model(**inputs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
gpu-q-49:     outputs = model(**inputs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
gpu-q-49:     outputs = model(**inputs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
gpu-q-49:     outputs = model(**inputs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
gpu-q-49:     return forward_call(*input, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
gpu-q-49:     return forward_call(*input, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
gpu-q-49:     return forward_call(*input, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
gpu-q-49:     return forward_call(*input, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
gpu-q-49:     return func(*args, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/deepspeed/runtime/engine.py", line 1836, in forward
gpu-q-49:         return func(*args, **kwargs)return func(*args, **kwargs)
gpu-q-49: 
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/deepspeed/runtime/engine.py", line 1836, in forward
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/deepspeed/runtime/engine.py", line 1836, in forward
gpu-q-49:     return func(*args, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/deepspeed/runtime/engine.py", line 1836, in forward
gpu-q-49:     loss = self.module(*inputs, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
gpu-q-49:     loss = self.module(*inputs, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
gpu-q-49:     loss = self.module(*inputs, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
gpu-q-49:     loss = self.module(*inputs, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
gpu-q-49:     result = forward_call(*input, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 934, in forward
gpu-q-49:     result = forward_call(*input, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 934, in forward
gpu-q-49:     result = forward_call(*input, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 934, in forward
gpu-q-49:     result = forward_call(*input, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 934, in forward
gpu-q-49:     outputs = self.model.decoder(
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
gpu-q-49:     outputs = self.model.decoder(
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
gpu-q-49:     outputs = self.model.decoder(
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
gpu-q-49:     outputs = self.model.decoder(
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
gpu-q-49:     result = forward_call(*input, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 698, in forward
gpu-q-49:     result = forward_call(*input, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 698, in forward
gpu-q-49:     result = forward_call(*input, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 698, in forward
gpu-q-49:     result = forward_call(*input, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 698, in forward
gpu-q-49:     layer_outputs = decoder_layer(
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
gpu-q-49:         layer_outputs = decoder_layer(layer_outputs = decoder_layer(
gpu-q-49: 
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
gpu-q-49:         layer_outputs = decoder_layer(
gpu-q-49: result = forward_call(*input, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 327, in forward
gpu-q-49:     result = forward_call(*input, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 327, in forward
gpu-q-49:     result = forward_call(*input, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 327, in forward
gpu-q-49:     result = forward_call(*input, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 327, in forward
gpu-q-49:     hidden_states, self_attn_weights, present_key_value = self.self_attn(
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
gpu-q-49:         hidden_states, self_attn_weights, present_key_value = self.self_attn(hidden_states, self_attn_weights, present_key_value = self.self_attn(
gpu-q-49: 
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
gpu-q-49:     hidden_states, self_attn_weights, present_key_value = self.self_attn(
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
gpu-q-49:     result = forward_call(*input, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 228, in forward
gpu-q-49:     result = forward_call(*input, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 228, in forward
gpu-q-49:     result = forward_call(*input, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 228, in forward
gpu-q-49:     result = forward_call(*input, **kwargs)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 228, in forward
gpu-q-49:     attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(torch.float16)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/nn/functional.py", line 1843, in softmax
gpu-q-49:         attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(torch.float16)attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(torch.float16)
gpu-q-49: 
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/nn/functional.py", line 1843, in softmax
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/nn/functional.py", line 1843, in softmax
gpu-q-49:     attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(torch.float16)
gpu-q-49:   File "/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/lib64/python3.8/site-packages/torch/nn/functional.py", line 1843, in softmax
gpu-q-49:     ret = input.softmax(dim, dtype=dtype)
gpu-q-49:     ret = input.softmax(dim, dtype=dtype)
gpu-q-49: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB (GPU 1; 79.17 GiB total capacity; 76.68 GiB already allocated; 403.31 MiB free; 77.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
gpu-q-49: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB (GPU 2; 79.17 GiB total capacity; 76.68 GiB already allocated; 405.31 MiB free; 77.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
gpu-q-49:     ret = input.softmax(dim, dtype=dtype)
gpu-q-49: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB (GPU 3; 79.17 GiB total capacity; 76.68 GiB already allocated; 405.31 MiB free; 77.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
gpu-q-49:     ret = input.softmax(dim, dtype=dtype)
gpu-q-49: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB (GPU 0; 79.17 GiB total capacity; 76.68 GiB already allocated; 403.31 MiB free; 77.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

pdsh@gpu-q-13: gpu-q-49: ssh exited with exit code 1
gpu-q-13: [2023-02-22 14:31:09,894] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3592124
gpu-q-13: [2023-02-22 14:31:09,921] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3592125
gpu-q-13: [2023-02-22 14:31:09,921] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3592126
gpu-q-13: [2023-02-22 14:31:09,943] [ERROR] [launch.py:324:sigkill_handler] ['/rds/project/rds-lSmP1cwRttU/lz429/venv_transformers/bin/python3.8', '-u', 'train.py', '--local_rank=3'] exits with return code = 1
pdsh@gpu-q-13: gpu-q-13: ssh exited with exit code 1

microsoft / DeepSpeed

OOM when using Deepspeed ZERO-3 to train a galactica 30b model. #2899