stas00 / ml-engineering

Machine Learning Engineering Open Book
https://stasosphere.com/machine-learning/
Creative Commons Attribution Share Alike 4.0 International
10.63k stars 641 forks source link

Adding another logbook (kinda) #52

Open boweiliu opened 3 months ago

boweiliu commented 3 months ago

Have you read https://arxiv.org/pdf/2402.15627 already?

There's a lot of details in the later sections that deal with ML training in practice -- garbage collection, autorestarting, IB over ethernet issues etc.

stas00 commented 3 months ago

Thank you very much for the recommendation, @boweiliu

I have it on the list, but didn't have a chance to read it yet.

Your list sounds fitting the content of this repo.

yaolu commented 1 month ago

the garbage collection issue outlined in this paper (section 6.3 MFU decreasing) also matches the observation from imbue blog

MFU graph gradually sagged downward over the course of a run, but returned to 100% upon any restart)