how to set "training_step" during training?

qwerfdsadad commented 3 months ago

Describe the bug I use zero1 to train a unet network with the following deepspeed_config configuration. I set 10 epochs and the output during training is as follows:

{
        "train_micro_batch_size_per_gpu": 1,
        "gradient_accumulation_steps": 1,
        "local_rank": 0,
        "steps_per_print": 500,
        "optimizer": {
            "type": "Adam",
            "params": {"lr": 0, "betas": [0.9, 0.98], "eps": 1e-9, "weight_decay": 3e-7}
        },
        "scheduler": {
            "type": "WarmupDecayLR",
            "params": {
                "total_num_steps": 4000,
                "warmup_min_lr": 0.00001,
                "warmup_max_lr": 0.01,
                "warmup_num_steps": 1000,
                "warmup_type": "linear",
                "last_batch_iteration": -1,
            },
        },
        "bf16": {"enabled": false},
        "fp16": {
            "enabled": true,
            "auto_cast": false,
            "loss_scale": 0,
            "initial_scale_power": 16,
            "loss_scale_window": 1000,
            "hysteresis": 2,
            "consecutive_hysteresis": false,
            "min_loss_scale": 1,
        },
        "zero_optimization": {
            "stage": 1,
            "reduce_bucket_size": 5e8,
            "allgather_bucket_size": 5e8,
            "reduce_scatter": true,
        },
        "logging": {"log_level": "INFO"},
    }

output

[2024-07-17 19:44:06,829] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
[2024-07-17 19:44:06,945] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
[2024-07-17 19:44:07,064] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768, reducing to 16384
[2024-07-17 19:44:07,184] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384, reducing to 8192
[2024-07-17 19:44:07,303] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8192, reducing to 4096
[2024-07-17 19:44:07,540] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4096, reducing to 2048
[2024-07-17 19:44:09,300] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2048, reducing to 1024
[2024-07-17 19:44:19,048] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024, reducing to 512
[2024-07-17 19:44:50,382] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 512, reducing to 256
[2024-07-17 19:45:07,165] [INFO] [logging.py:96:log_dist] [Rank 0] step=500, skipped=9, lr=[0.0049051], mom=[[0.9, 0.98]]
[2024-07-17 19:45:07,166] [INFO] [timer.py:258:stop] epoch=0/micro_step=500/global_step=500, RunningAvgSamplesPerSec=16.543493968672824, CurrSamplesPerSec=17.400610263292727, MemAllocated=0.02GB, MaxMemAllocated=0.97GB
[2024-07-17 19:46:07,521] [INFO] [logging.py:96:log_dist] [Rank 0] step=1000, skipped=9, lr=[0.0099001], mom=[[0.9, 0.98]]
[2024-07-17 19:46:07,522] [INFO] [timer.py:258:stop] epoch=0/micro_step=1000/global_step=1000, RunningAvgSamplesPerSec=16.558021050068817, CurrSamplesPerSec=17.281347468346606, MemAllocated=0.02GB, MaxMemAllocated=0.97GB
[2024-07-17 19:47:09,909] [INFO] [logging.py:96:log_dist] [Rank 0] step=1500, skipped=9, lr=[0.0083683], mom=[[0.9, 0.98]]
[2024-07-17 19:47:09,910] [INFO] [timer.py:258:stop] epoch=0/micro_step=1500/global_step=1500, RunningAvgSamplesPerSec=16.37900488784748, CurrSamplesPerSec=15.552286788003284, MemAllocated=0.02GB, MaxMemAllocated=0.97GB
[2024-07-17 19:48:14,992] [INFO] [logging.py:96:log_dist] [Rank 0] step=2000, skipped=9, lr=[0.006703300000000001], mom=[[0.9, 0.98]]
[2024-07-17 19:48:14,993] [INFO] [timer.py:258:stop] epoch=0/micro_step=2000/global_step=2000, RunningAvgSamplesPerSec=16.113892734914483, CurrSamplesPerSec=14.6544845971357, MemAllocated=0.02GB, MaxMemAllocated=0.97GB
[2024-07-17 19:49:19,610] [INFO] [logging.py:96:log_dist] [Rank 0] step=2500, skipped=9, lr=[0.0050383], mom=[[0.9, 0.98]]
[2024-07-17 19:49:19,610] [INFO] [timer.py:258:stop] epoch=0/micro_step=2500/global_step=2500, RunningAvgSamplesPerSec=15.982786991454645, CurrSamplesPerSec=16.03645984675853, MemAllocated=0.02GB, MaxMemAllocated=0.97GB
[2024-07-17 19:49:29,863] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024, but hysteresis is 2. Reducing hysteresis to 1
[2024-07-17 19:50:22,010] [INFO] [logging.py:96:log_dist] [Rank 0] step=3000, skipped=10, lr=[0.0033766300000000003], mom=[[0.9, 0.98]]
[2024-07-17 19:50:22,011] [INFO] [timer.py:258:stop] epoch=0/micro_step=3000/global_step=3000, RunningAvgSamplesPerSec=15.99056605536519, CurrSamplesPerSec=15.759469462135302, MemAllocated=0.02GB, MaxMemAllocated=0.97GB
[2024-07-17 19:51:23,297] [INFO] [logging.py:96:log_dist] [Rank 0] step=3500, skipped=10, lr=[0.0017116300000000002], mom=[[0.9, 0.98]]
[2024-07-17 19:51:23,298] [INFO] [timer.py:258:stop] epoch=0/micro_step=3500/global_step=3500, RunningAvgSamplesPerSec=16.036943131294425, CurrSamplesPerSec=17.25923182644907, MemAllocated=0.02GB, MaxMemAllocated=0.97GB
[2024-07-17 19:51:35,403] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2048, but hysteresis is 2. Reducing hysteresis to 1
[2024-07-17 19:51:54,774] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2048, reducing to 1024
[2024-07-17 19:52:23,846] [INFO] [logging.py:96:log_dist] [Rank 0] step=4000, skipped=12, lr=[5.329e-05], mom=[[0.9, 0.98]]
[2024-07-17 19:52:23,847] [INFO] [timer.py:258:stop] epoch=0/micro_step=4000/global_step=4000, RunningAvgSamplesPerSec=16.095795852718908, CurrSamplesPerSec=17.06993117987245, MemAllocated=0.02GB, MaxMemAllocated=0.97GB
[2024-07-17 19:52:53,931] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024, reducing to 512
[2024-07-17 19:53:24,980] [INFO] [logging.py:96:log_dist] [Rank 0] step=4500, skipped=13, lr=[1e-05], mom=[[0.9, 0.98]]
[2024-07-17 19:53:24,981] [INFO] [timer.py:258:stop] epoch=0/micro_step=4500/global_step=4500, RunningAvgSamplesPerSec=16.124955167964504, CurrSamplesPerSec=16.807402094161112, MemAllocated=0.02GB, MaxMemAllocated=0.97GB
/home/lg/miniconda3/envs/jaxtest/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:138: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
/home/lg/miniconda3/envs/jaxtest/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:138: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
epoch: 0, loss: 1.17383, t2-t1: 561.80249, trainL2: 17024.88278, testL2: 10.81152
epoch: 0, loss: 1.17383, t2-t1: 562.04985, trainL2: 17270.87160, testL2: 10.81152

This one reports a lot of "[2024-07-17 19:53:24,981] [INFO] [timer.py:258:stop] epoch=0/micro_step=4500/global_step=4500, RunningAvgSamplesPerSec=16.124955167964504, CurrSamplesPerSec=16.807402094161112, MemAllocated=0.02GB, MaxMemAllocated=0.97GB".

it looks like the adam optimizer calculates 4,500 steps every 1 epoch, which results in taking a long time to train for 1 epoch. why is there so much output?

ds_report output

[2024-07-17 19:59:25,808] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fp_quantizer ........... [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------

System info (please complete the following information):

OS: Ubuntu 20.04
GPU count and types: 4*3090
Python version: 3.9.18
- deepspeed 0.14.4

FattyFace commented 2 months ago

Did you use a trainer from transformers.trainer? If so, you can write class GlobalStepUpdaterCallback(TrainerCallback): def on_step_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs): model = kwargs.get('model', None) if model and hasattr(model, "get_global_step"): model.get_global_step(state.global_step) return control in your trainer file and add get_global_step() in your model as a model function: def get_global_step(self, global_step): self.global_step = global_step so that you can get global_step by using self.global_step in your training

qwerfdsadad commented 2 months ago

I just use a simple model to test the functionality of deepseed, without using transformer.

jomayeri commented 1 month ago

I'm unclear on the question. You just want to limit the amount of printouts displayed for each epoch?

microsoft / DeepSpeed

how to set "training_step" during training? #5779