huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.32k stars 871 forks source link

Does DeepSpeed + Accelerate Support Pipeline Parallelism #2838

Open sam-h-bean opened 3 weeks ago

sam-h-bean commented 3 weeks ago

I have been trying a number of pipeline configs in deepspeed like the following

{
    "fp16": {
        "enabled": true
    },
    "bf16": {
        "enabled": false
    },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "gather_16bit_weights_on_model_save": true,
        "round_robin_gradients": true,
        "reduce_scatter": true,
        "zero_quantized_weights": true,
        "zero_hpz_partition_size": 8,
        "zero_quantized_gradients": true
    },
    "gradient_accumulation_steps": 1,
    "gradient_clipping": "auto",
    "steps_per_print": 1,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false,
    "flops_profiler": {
        "enabled": true,
        "profile_step": 10,
        "module_depth": -1,
        "top_modules": 1,
        "detailed": true
   },
    "pipeline": {
        "stages": 8,
        "partition_method": "uniform"
    }
}

And can see the pipeline configs being displayed in my training logs when DeepSpeed outputs the full configuration. However, it seems like the changes I make the pipeline have no effect on training. I am wondering if these config options are somehow being thrown away by Accelerate. Curious if others have found ways to get some introspection on how PP is working in deepspeed + accelerate.

sam-h-bean commented 3 weeks ago

Seems from the docs this is called out in a caveat. It might make sense to loudly crash when someone tries to directly configure PP? Also what is the plan to integrate PP in accelerate?