sanyalsunny111 / LLM-Inheritune

This is the official repository for Inheritune.
https://arxiv.org/abs/2404.08634
79 stars 9 forks source link

Hyper-parameters are inconsistent with Table 8 #4

Open ZeguanXiao opened 1 week ago

ZeguanXiao commented 1 week ago

Hyper-parameters in the training script are inconsistent with paper, including batch size, lr scheduler and weight decay. The train epoch is also missing, instead of a max_iters. In my experience, it is hard to train a 1.5B model on 1B tokens in 8 hours. Is there something wrong in code or paper?

截屏2024-06-24 14 05 08
sanyalsunny111 commented 6 days ago

Yes, that's just a sample script provided for running an experiment for full reproducibly please use the paper's hyperparameters. What is the specific issue you are facing that lead to this opinion?

ZeguanXiao commented 6 days ago

My issue is that the training time is longer than paper said. When I run the code in a L40 48G GPU, it takes 37.8 seconds to run 100 iterations, which corresponds to 100 4096 = 409,600 tokens. So I think the 1B 8 epochs = 8B tokens will use much more times to run than 8 hours. Do I misunderstanding anything?

sanyalsunny111 commented 5 days ago

The training time is a function of the GPUs (also other variables) and I see you are using L40 whereas we have used A6000 I guess this could be the reason. But I would still like to know your GPU hours?

ZeguanXiao commented 5 days ago

The estimated cost is 8,000,000,000 tokens / 409,600 * 37.8 seconds / 3600 ≈ 205 GPU hours. Below is a screenshot that shows the time of running.

截屏2024-06-26 12 22 16
ZeguanXiao commented 2 days ago

Hi, @sanyalsunny111, I would appreciate it if you could rerun the example code and time it. I'm surprised at the large difference in GPU hours between L40 and A6000, and I wonder if I'm doing something wrong.

sanyalsunny111 commented 2 days ago

Hi @ZeguanXiao Sure, I am gonna do that. But I am currently on leave from my university for summer internship, so, I don't have access to my lab resources (A6000 which I used). Once I go back I will rerun this again.