Hello, thank you for providing the implementation of the paper. When I run the code, I found that when the optimizer.step() is called for the first time, it would take extremely long time.
For me, when pretrain llama_1b model on one A100 with batch_size == 1, running optimizer.step() for the first time cost me 70 seconds. But the time became normal (30ms) after the first step. Is this because of some tensor-register step?
Hello, thank you for providing the implementation of the paper. When I run the code, I found that when the optimizer.step() is called for the first time, it would take extremely long time. For me, when pretrain llama_1b model on one A100 with batch_size == 1, running optimizer.step() for the first time cost me 70 seconds. But the time became normal (30ms) after the first step. Is this because of some tensor-register step?