microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.9k stars 345 forks source link

update universal_checkpointing/README.md #395

Closed inkcherry closed 4 months ago

inkcherry commented 5 months ago

Is my understanding correct? due to compatibility with PP, for example, when using ds-SP, it needs to be disabled, which means some weights that previously relied on PP cannot be directly used.

samadejacobs commented 4 months ago

@inkcherry, UCP supports PP conversion to/from other parallelism topologies (ZeRO-DP, SP, TP etc), however, training with SP/PP combo with and without UCP has not been tested.

inkcherry commented 4 months ago

Thank you for your explanation~ @samadejacobs , yes I believe using SP without pp would be more stable, so I tried the following:

  1. Model trained with PP (without --no-pipeline-parallel, from some past workloads).
  2. Converted by UCP.
  3. Finetuned Model without pp (with --no-pipeline-parallel, with ds-sp ).

But I encountered a crash at step2 , weight names have changed, UCP does not work in this case, as mentioned in the documentation change. : )