Closed LWprogramming closed 1 year ago
@LWprogramming :pray: , we should def get the accelerate waits in!
however i think we should not do the scaling learning rate by number of GPUs, and just leave that up to the researcher to pass in. that paper is a bit dated and i don't believe the relationship is a linear one. just as an example to think about, if it took 25k GPUs to train GPT-4, what should their learning rate be?
@LWprogramming decided to get the wait invocations in; thanks for the PR again!
Based on https://arxiv.org/abs/1706.02677