Open ZeguanXiao opened 1 week ago
Yes, that's just a sample script provided for running an experiment for full reproducibly please use the paper's hyperparameters. What is the specific issue you are facing that lead to this opinion?
My issue is that the training time is longer than paper said. When I run the code in a L40 48G GPU, it takes 37.8 seconds to run 100 iterations, which corresponds to 100 4096 = 409,600 tokens. So I think the 1B 8 epochs = 8B tokens will use much more times to run than 8 hours. Do I misunderstanding anything?
The training time is a function of the GPUs (also other variables) and I see you are using L40 whereas we have used A6000 I guess this could be the reason. But I would still like to know your GPU hours?
The estimated cost is 8,000,000,000 tokens / 409,600 * 37.8 seconds / 3600 ≈ 205 GPU hours. Below is a screenshot that shows the time of running.
Hi, @sanyalsunny111, I would appreciate it if you could rerun the example code and time it. I'm surprised at the large difference in GPU hours between L40 and A6000, and I wonder if I'm doing something wrong.
Hi @ZeguanXiao Sure, I am gonna do that. But I am currently on leave from my university for summer internship, so, I don't have access to my lab resources (A6000 which I used). Once I go back I will rerun this again.
Hyper-parameters in the training script are inconsistent with paper, including batch size, lr scheduler and weight decay. The train epoch is also missing, instead of a max_iters. In my experience, it is hard to train a 1.5B model on 1B tokens in 8 hours. Is there something wrong in code or paper?