OOM (RAM) when saving checkpoint during full finetuning

On my RTX 3090 + 32Gb RAM machine I'm able to train a FLUX-LoRA just fine, and full-finetuning also trains (using adafactor), however the script crashes when trying to save the first full checkpoint due to insufficient (cpu) RAM. Is there any way to reduce the peak memory usage when saving the transformer checkpoint to disk?

I'm using the following sepcs:

        cmd = [
            "accelerate", "launch",
            "--num_cpu_threads_per_process", "1",
            "--num_processes", "1",  # run on 1 gpu, remove this line for multi-gpu training
            str(root_dir / "sd-scripts" / "flux_train.py"),
            "--dataset_config", config['dataset_config'],
            "--pretrained_model_name_or_path", config['MODEL_PATH'],
            "--clip_l", config['CLIP_L_PATH'],
            "--t5xxl", config['T5XXL_PATH'],
            "--ae", config['AE_PATH'],
            "--cache_latents_to_disk",
            "--save_model_as", "safetensors",
            "--sdpa",
            "--persistent_data_loader_workers",
            "--max_data_loader_n_workers", "2",
            "--seed", config['seed'],
            "--gradient_checkpointing",
            "--mixed_precision", "bf16",
            "--save_precision", "bf16",
            "--optimizer_type", "adafactor",
            "--optimizer_args", "relative_step=False", "scale_parameter=False", "warmup_init=False",
            "--fused_backward_pass",  
            "--blocks_to_swap", "8", 
            "--full_bf16", 
            "--learning_rate", config['learning_rate'],
            "--lr_scheduler", "cosine",
            #"--lr_scheduler", "constant_with_warmup",
            #"--cache_text_encoder_outputs",
            "--cache_text_encoder_outputs_to_disk",
            "--max_grad_norm", "0.0", 
            "--text_encoder_batch_size", "4",
            "--highvram",
            "--max_train_steps", config['max_train_steps'],
            "--save_every_n_steps", config['save_every_n_steps'],
            "--sample_every_n_steps", config['sample_every_n_steps'],
            "--sample_prompts", config['eval_prompts'],
            "--sample_at_first",
            "--output_dir", str(config["output_dir"]),
            "--output_name", config["output_name"],
            "--timestep_sampling", "shift",
            "--discrete_flow_shift", "3.1582",
            "--model_prediction_type", "raw",
            "--guidance_scale", "1.0"]

kohya-ss / sd-scripts

OOM (RAM) when saving checkpoint during full finetuning #1749