Closed zmtttt closed 1 month ago
I‘m also interested in using ndtimeline for my code, but struggling to modify it.
I‘m also interested in using ndtimeline for my code, but struggling to modify it.
I have sloved , just adding ndtimeline_init , modifying schedules and p2p-communication with correct metrics, but I only achieve this on single machine, I wandered how to use this with muti machines
I‘m also interested in using ndtimeline for my code, but struggling to modify it.
I have sloved , just adding ndtimeline_init , modifying schedules and p2p-communication with correct metrics, but I only achieve this on single machine, I wandered how to use this with muti machines
Great try! I'm doing some exps on megatron and deepspeed's PP and struggle to monitor the communications as well. Hope there would be an example for custom users by official team or you.I think a more general and stand-alone toolkit would be better for the development of the community.
(1)my progress:I modify nditimeline/init and p2p_communication.py and schedule.py but failed to get right timeline. (2)why??and I wandered why need to register instruction in ndtimeline/pipedream_flush.py? I do not use register instruction , I use @ndtimer(SEND_BACKWARD) def send_backward(input_tensor_grads, tensor_shapes, config) in megatron/core/pipeline_parallel/schedules.py , all interces use the same method.
Originally posted by @zmtttt in https://github.com/volcengine/veScale/issues/51#issuecomment-2328217367