Closed qwerfdsadad closed 4 weeks ago
Did you use a trainer from transformers.trainer? If so, you can write class GlobalStepUpdaterCallback(TrainerCallback): def on_step_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs): model = kwargs.get('model', None) if model and hasattr(model, "get_global_step"): model.get_global_step(state.global_step) return control in your trainer file and add get_global_step() in your model as a model function: def get_global_step(self, global_step): self.global_step = global_step so that you can get global_step by using self.global_step in your training
I just use a simple model to test the functionality of deepseed, without using transformer.
I'm unclear on the question. You just want to limit the amount of printouts displayed for each epoch?
Describe the bug I use zero1 to train a unet network with the following deepspeed_config configuration. I set 10 epochs and the output during training is as follows:
output
This one reports a lot of "[2024-07-17 19:53:24,981] [INFO] [timer.py:258:stop] epoch=0/micro_step=4500/global_step=4500, RunningAvgSamplesPerSec=16.124955167964504, CurrSamplesPerSec=16.807402094161112, MemAllocated=0.02GB, MaxMemAllocated=0.97GB".
it looks like the adam optimizer calculates 4,500 steps every 1 epoch, which results in taking a long time to train for 1 epoch. why is there so much output?
ds_report output
System info (please complete the following information):