volcengine / veScale

A PyTorch Native LLM Training Framework
http://vescale.xyz
Apache License 2.0
575 stars 28 forks source link

[QUESTION]Using ndtimeline-tool to Monitor Megatron-GPT #51

Open zmtttt opened 2 weeks ago

zmtttt commented 2 weeks ago

Using ndtimeline-tool to Monitor Megatron-GPT I want to use the ndtimeline-tool to monitor the computation and communication of each rank in Megatron-GPT. I have two concerns:

1:Before calling init_ndtimeline, initialization is required. Would this conflict with Megatron's own initialize_megatron function? Both involve operations related to process groups, so this could potentially cause communication issues later on.

2:The interfaces of Megatron-LM and vescale are different. How can I integrate the computational interfaces, such as major-metrics, tp-stream-metrics, dp-stream-metrics, pp-batch-stream-metrics, and pp-forward-stream-metrics? Has anyone successfully used ndtimeline-tool with Megatron-GPT before?

thanks!

zmtttt commented 1 week ago

Using ndtimeline-tool to Monitor Megatron-GPT I want to use the ndtimeline-tool to monitor the computation and communication of each rank in Megatron-GPT. I have two concerns:

1:Before calling init_ndtimeline, initialization is required. Would this conflict with Megatron's own initialize_megatron function? Both involve operations related to process groups, so this could potentially cause communication issues later on.

2:The interfaces of Megatron-LM and vescale are different. How can I integrate the computational interfaces, such as major-metrics, tp-stream-metrics, dp-stream-metrics, pp-batch-stream-metrics, and pp-forward-stream-metrics? Has anyone successfully used ndtimeline-tool with Megatron-GPT before?

thanks!

(1)my progress:I modify nditimeline/init and p2p_communication.py and schedule.py in megatron , but failed to get right timeline. (2)why??and I wandered why need to register instruction in ndtimeline/pipedream_flush.py? I do not use register instruction , I use @ndtimer(SEND_BACKWARD) def send_backward(input_tensor_grads, tensor_shapes, config) in megatron/core/pipeline_parallel/schedules.py , all interces use the same method.@vocaltract

wrong-megatron