Open Liz178 opened 3 weeks ago
@Liz178, the configuration knobs most related to all-gather are _prefetch_bucket_size
, _persistence_threshold
, _max_live_parameters
, and _max_reuse_distances
.
Please see below for descriptions of ZeRO-3 configuration: https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training
Hello there, I am a beginner for using deepspeed. Now I am using deepspeed zero3 to train LLava-1.5-13B and profiling the training process.
After setting "stage3_prefetch_bucket_size", "stage3_param_persistence_threshold" and "reduce_bucket_size" in ZeRO3 config, I can see the size of each reduce-scatter is near to "reduce_bucket_size" (which I can see from the code), however, the size of each all_gather operation is not clear to me. I was wondering which configuration it is related to and it seems to me that setting "allgather_bucket_size" is not working.
Plus, I was wondering what the persisting parameters means? Does it mean that in parameter partitioning, each time the parameters from the sub-module are accumulated until the size of persisting parameters pre-setting and then are partitioned once exceeding the threshold?