A question on learning rate decay schedule

jzhang38 / TinyLlama

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.

Apache License 2.0

7.3k stars 425 forks source link

A question on learning rate decay schedule #172

Closed zyushun closed 3 months ago

zyushun commented 3 months ago

For the learning rate decay schedule, why do you use "lr = get_lr(state["iter_num"]) if decay_lr else learning_rate"? Here, "state["iter_num"]" is the number of processed minibatch, not the number of training steps.

By definition of decay learning rate, shouldn't we use "state["step_count"]" instead of ""state["iter_num"]"?

I am talking about line 213 in TinyLlama/pretrain/tinyllama.py

jzhang38 commented 3 months ago

Yeah, we indeed use iter_num, which is the minibatch number to calculate the learning rate.

I believe it has minimal effect on the training loss.