configuration setting problems for parameters partitioning in training

Hello there, I am a beginner for using deepspeed. Now I am using deepspeed zero3 to train LLava-1.5-13B and profiling the training process.

After setting "stage3_prefetch_bucket_size", "stage3_param_persistence_threshold" and "reduce_bucket_size" in ZeRO3 config, I can see the size of each reduce-scatter is near to "reduce_bucket_size" (which I can see from the code), however, the size of each all_gather operation is not clear to me. I was wondering which configuration it is related to and it seems to me that setting "allgather_bucket_size" is not working.

Plus, I was wondering what the persisting parameters means? Does it mean that in parameter partitioning, each time the parameters from the sub-module are accumulated until the size of persisting parameters pre-setting and then are partitioned once exceeding the threshold?

### Tasks

microsoft / DeepSpeed

configuration setting problems for parameters partitioning in training #6420