Explain the issue

The terminator which connects to the BitDeer server, will halt at each saving time. My training script will save the checkpoint every 30 minutes, meanwhile, the terminator will halt at this same period for almost 1 minute until saving is done.

There are two reasons I guess, 1, my saving method may be a CPU-shark, which will take up too much CPU memory. 2, the saving disk relies on network transfer instead of local file writing, and such saving will consume much more bandwidth. [strong agree this point ]

Reproduce script in Traning

def safe_save_model_for_hf_trainer(
    trainer: transformers.Trainer, output_dir: str, bias="none"
):
    """Collects the state dict and dump to disk."""
    # check if zero3 mode enabled
    if deepspeed.is_deepspeed_zero3_enabled():
        state_dict = trainer.model_wrapped._zero3_consolidated_16bit_state_dict()
    else:
        if trainer.args.use_lora:
            state_dict = get_peft_state_maybe_zero_3(
                trainer.model.named_parameters(), bias
            )
        else:
            state_dict = trainer.model.state_dict()
    if trainer.args.should_save and trainer.args.local_rank == 0:
        trainer._save(output_dir, state_dict=state_dict)

Recommend resolving methods:

Is that RAM overload when saving, if YES, please give the tips to the user when creating a small RAM instance.

Is the saving in remote disk, such as transferring into the Data disk by occupying bandwidth, if YES, please expand the bandwidth or give tips to the user.

hejing / instance_containize

The terminator didnot response any when save checkpoint #4

Explain the issue

Reproduce script in Traning

Recommend resolving methods: