sangmichaelxie / doremi

Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
https://arxiv.org/abs/2305.10429
MIT License
286 stars 32 forks source link

step 1 baseline_280M loss large #1

Closed gawei1995 closed 1 year ago

gawei1995 commented 1 year ago

280M baseline model loss is hovering around 5, with all training hyperparameters set default values. The preprocess file sampler is set to 10w image

image

gawei1995 commented 1 year ago

i find the bug trainer parameters is 40M , not 380M ,but why ????

image

image

sangmichaelxie commented 1 year ago

The loss value around 5 is expected, and things should be working normally for you. The number of trainable parameters being smaller is due to sharding across the GPUs. If you try with 1 GPU it should say the full number.

gawei1995 commented 1 year ago

The loss value around 5 is expected, and things should be working normally for you. The number of trainable parameters being smaller is due to sharding across the GPUs. If you try with 1 GPU it should say the full number.

thank you for replying to me,but when i use the gpt2 official version not ur gpt2fast version, the loss reached 2.4. so i think it's a model architecture issue

gawei1995 commented 1 year ago

The loss value around 5 is expected, and things should be working normally for you. The number of trainable parameters being smaller is due to sharding across the GPUs. If you try with 1 GPU it should say the full number.

step1 loss when use gpt2 official version

image

sangmichaelxie commented 1 year ago

Actually, you're right, thanks for pointing this out. I've replaced the model in the current HEAD; you'll need to run bash scripts/setup_flash.sh before running run_pile_baseline280M.sh again. In prelim tests the loss goes much lower than before.