Do we have to do tracing in a real system when we try to analyze a 10,000 GPUs training system?

mlcommons / chakra

Repository for MLCommons Chakra schema and tools

https://mlcommons.org/working-groups/research/chakra/

Apache License 2.0

69 stars 38 forks source link

Do we have to do tracing in a real system when we try to analyze a 10,000 GPUs training system? #164

Open basicmi opened 1 month ago

basicmi commented 1 month ago

Accoring to the Astra-sim 2.0 paper, simulates based on Chakra trace, to "decouple parallelization strategies from the ASTRAsim implementation" . Does that mean we have to trace a real 10,000 GPUs AI training system before we can do simulation and analysis of the system in that scale?

Thanks

191220042 commented 5 days ago

I have the same question