Open boweiliu opened 3 months ago
Thank you very much for the recommendation, @boweiliu
I have it on the list, but didn't have a chance to read it yet.
Your list sounds fitting the content of this repo.
the garbage collection issue outlined in this paper (section 6.3 MFU decreasing) also matches the observation from imbue blog
MFU graph gradually sagged downward over the course of a run, but returned to 100% upon any restart)
Have you read https://arxiv.org/pdf/2402.15627 already?
There's a lot of details in the later sections that deal with ML training in practice -- garbage collection, autorestarting, IB over ethernet issues etc.