volcengine / veScale

A PyTorch Native LLM Training Framework
http://vescale.xyz
Apache License 2.0
581 stars 28 forks source link

> Using ndtimeline-tool to Monitor Megatron-GPT I want to use the ndtimeline-tool to monitor the computation and communication of each rank in Megatron-GPT. I have two concerns: #53

Open zmtttt opened 2 weeks ago

zmtttt commented 2 weeks ago
          > Using ndtimeline-tool to Monitor Megatron-GPT I want to use the ndtimeline-tool to monitor the computation and communication of each rank in Megatron-GPT. I have two concerns:

1:Before calling init_ndtimeline, initialization is required. Would this conflict with Megatron's own initialize_megatron function? Both involve operations related to process groups, so this could potentially cause communication issues later on.

2:The interfaces of Megatron-LM and vescale are different. How can I integrate the computational interfaces, such as major-metrics, tp-stream-metrics, dp-stream-metrics, pp-batch-stream-metrics, and pp-forward-stream-metrics? Has anyone successfully used ndtimeline-tool with Megatron-GPT before?

thanks!

(1)my progress:I modify nditimeline/init and p2p_communication.py and schedule.py but failed to get right timeline. (2)why??and I wandered why need to register instruction in ndtimeline/pipedream_flush.py? I do not use register instruction , I use @ndtimer(SEND_BACKWARD) def send_backward(input_tensor_grads, tensor_shapes, config) in megatron/core/pipeline_parallel/schedules.py , all interces use the same method.

wrong-megatron

Originally posted by @zmtttt in https://github.com/volcengine/veScale/issues/51#issuecomment-2328217367

mzxcpp commented 4 days ago

I‘m also interested in using ndtimeline for my code, but struggling to modify it.

zmtttt commented 11 hours ago

I‘m also interested in using ndtimeline for my code, but struggling to modify it.

I have sloved , just adding ndtimeline_init , modifying schedules and p2p-communication with correct metrics, but I only achieve this on single machine, I wandered how to use this with muti machines

mzxcpp commented 10 hours ago

I‘m also interested in using ndtimeline for my code, but struggling to modify it.

I have sloved , just adding ndtimeline_init , modifying schedules and p2p-communication with correct metrics, but I only achieve this on single machine, I wandered how to use this with muti machines

Great try! I'm doing some exps on megatron and deepspeed's PP and struggle to monitor the communications as well. Hope there would be an example for custom users by official team or you.I think a more general and stand-alone toolkit would be better for the development of the community.