Open mattf1n opened 1 month ago
Tuning the learning rate for hidden state inputs to reduce grad norm spikes. Current LR: 0.0002 To try: 0.0001, 0.00002 Run for only 100,000 examples, warmup steps divide by 10 as well, eval steps to 2500
Wait for Llama-scale experiments
Tuning the learning rate for hidden state inputs to reduce grad norm spikes. Current LR: 0.0002 To try: 0.0001, 0.00002 Run for only 100,000 examples, warmup steps divide by 10 as well, eval steps to 2500
Wait for Llama-scale experiments