microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.7k stars 4.04k forks source link

configuration setting problems for parameters partitioning in training #6420

Open Liz178 opened 3 weeks ago

Liz178 commented 3 weeks ago

Hello there, I am a beginner for using deepspeed. Now I am using deepspeed zero3 to train LLava-1.5-13B and profiling the training process.

After setting "stage3_prefetch_bucket_size", "stage3_param_persistence_threshold" and "reduce_bucket_size" in ZeRO3 config, I can see the size of each reduce-scatter is near to "reduce_bucket_size" (which I can see from the code), however, the size of each all_gather operation is not clear to me. I was wondering which configuration it is related to and it seems to me that setting "allgather_bucket_size" is not working.

Plus, I was wondering what the persisting parameters means? Does it mean that in parameter partitioning, each time the parameters from the sub-module are accumulated until the size of persisting parameters pre-setting and then are partitioned once exceeding the threshold?

### Tasks
tjruwase commented 2 weeks ago

@Liz178, the configuration knobs most related to all-gather are _prefetch_bucket_size, _persistence_threshold, _max_live_parameters, and _max_reuse_distances.

Please see below for descriptions of ZeRO-3 configuration: https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training