volcengine / veScale

A PyTorch Native LLM Training Framework
http://vescale.xyz
Apache License 2.0
662 stars 34 forks source link

[QUESTION]How to use MQhandler for muti machines? #56

Closed zmtttt closed 1 month ago

zmtttt commented 1 month ago

hello!!! I wandered how to use muti machines? “ - In case you need a tracing file related to ranks on different machines, you can implement an MQHandler by yourself and send all metrics to a central storage. This provides you with a method to filter and generate the tracing file for specified ranks.” MQhandler is messege-queue? and what‘s central storage? how to achieve it? have you evaluate the time comsuption? @Meteorix @MingjiHan99 @pengyanghua @MackZackA

MackZackA commented 1 month ago

Thank you for your interest in veScale! For this question, I would like to refer you to talk to @vocaltract who is an expert in MQ handler.

MackZackA commented 1 month ago

This question appears to be duplicated with https://github.com/volcengine/veScale/issues/55 Closed.