kohya-ss / sd-scripts

Apache License 2.0
5.33k stars 881 forks source link

OOM (RAM) when saving checkpoint during full finetuning #1749

Closed aiXander closed 3 weeks ago

aiXander commented 4 weeks ago

On my RTX 3090 + 32Gb RAM machine I'm able to train a FLUX-LoRA just fine, and full-finetuning also trains (using adafactor), however the script crashes when trying to save the first full checkpoint due to insufficient (cpu) RAM. Is there any way to reduce the peak memory usage when saving the transformer checkpoint to disk?

I'm using the following sepcs:

        cmd = [
            "accelerate", "launch",
            "--num_cpu_threads_per_process", "1",
            "--num_processes", "1",  # run on 1 gpu, remove this line for multi-gpu training
            str(root_dir / "sd-scripts" / "flux_train.py"),
            "--dataset_config", config['dataset_config'],
            "--pretrained_model_name_or_path", config['MODEL_PATH'],
            "--clip_l", config['CLIP_L_PATH'],
            "--t5xxl", config['T5XXL_PATH'],
            "--ae", config['AE_PATH'],
            "--cache_latents_to_disk",
            "--save_model_as", "safetensors",
            "--sdpa",
            "--persistent_data_loader_workers",
            "--max_data_loader_n_workers", "2",
            "--seed", config['seed'],
            "--gradient_checkpointing",
            "--mixed_precision", "bf16",
            "--save_precision", "bf16",
            "--optimizer_type", "adafactor",
            "--optimizer_args", "relative_step=False", "scale_parameter=False", "warmup_init=False",
            "--fused_backward_pass",  
            "--blocks_to_swap", "8", 
            "--full_bf16", 
            "--learning_rate", config['learning_rate'],
            "--lr_scheduler", "cosine",
            #"--lr_scheduler", "constant_with_warmup",
            #"--cache_text_encoder_outputs",
            "--cache_text_encoder_outputs_to_disk",
            "--max_grad_norm", "0.0", 
            "--text_encoder_batch_size", "4",
            "--highvram",
            "--max_train_steps", config['max_train_steps'],
            "--save_every_n_steps", config['save_every_n_steps'],
            "--sample_every_n_steps", config['sample_every_n_steps'],
            "--sample_prompts", config['eval_prompts'],
            "--sample_at_first",
            "--output_dir", str(config["output_dir"]),
            "--output_name", config["output_name"],
            "--timestep_sampling", "shift",
            "--discrete_flow_shift", "3.1582",
            "--model_prediction_type", "raw",
            "--guidance_scale", "1.0"]
kohya-ss commented 3 weeks ago

Please add --mem_eff_save option. This uses a custom implementation of the model saving function instead of safetensors library to reduce memory consumption when saving. Please reopen if the issue remains.