jzhang38 / TinyLlama

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
Apache License 2.0
7.61k stars 444 forks source link

Notes on chat fine-tuning and datacontent #55

Closed RonanKMcGovern closed 11 months ago

RonanKMcGovern commented 11 months ago

I adapted TimDettmers filtered Openassistant dataset in order for it to take the Llama 2 prompt format (e.g. with INST), see here.

I then fine-tuned TinyLlama (using a full fine-tune of all LoRA modules) at the 1T token checkpoint, see here.

Observations: A. TinyLlama seems to have issues emitting an EOS (< /s > token). For example:

<s> [INST] What planets are in our solar system? [/INST] 1. Mercury

2. Venus

3. Earth

4. Mars

5. Jupiter

6. Saturn

7. Uranus

8. Neptune

9. Pluto

10. Ceres

11. Callisto

12. ...

This leads me to wonder are BOS and, particularly, EOS tokens being used in pre-training (e.g. < s > and < /s >)?

B. I notice that when inferencing the raw 1T checkpoint (i.e. not chat fine-tuned), it is common to see ### in the response:

<s> [INST] Generate a python code snippet to add two numbers. [/INST] 

### [INST] Generate a python code snippet to add two numbers.

### [INST] Generate a python code snippet to add two numbers.

...

I'm somewhat surprised to see this '###'. Does this mean there are some chat fine-tuning or instruct fine-tuning datasets in the pre-training datasets?

jzhang38 commented 11 months ago

Thanks for the question.

BOS is used during pretraining to separate documents. On a hindsight, I should have used EOS.

But I am surprised that the model had trouble ending at EOS, if you add such tokens in the FT stage. Are you sure yo have calculated the loss on the EOS token?

While the pretraining data used is strictly Slimpajama and Starcodercode, I cannot answer if texts similar to chat or ### exists in those datasets.

RonanKMcGovern commented 11 months ago

Ok, that's interesting you justed used BOS. That probably explains why things are working for 7B but not TinyLlama, see below.

I think you may be right that my trainer may be setting the loss mask to zero for special tokens, so I need to take a look at that.

I've run follow up fine tuning on TinyLlama and Llama 7B and can confirm that 7B fine-tunes well using the SFTTrainer (which indeed may be setting the loss mask to zero on the eos token):

<s> [INST] What planets are in our solar system? [/INST] 1. The Sun 2. Mercury 3. Venus 4. Earth 5. Mars 6. Jupiter 7. Saturn 8. Uranus 9. Neptune</s>

<s> [INST] What are the first five numbers in the Fibonacci series? [/INST] 1, 1, 2, 3, 5</s>

<s> [INST] Generate a python code snippet to add two numbers. [/INST] 

python
def add(a, b):
   return a + b
</s>

By contrast, TinyLlama does not emit , which makes sense to me if it was not trained as such:

<s> [INST] What planets are in our solar system? [/INST] 1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune 9. Pluto 10. Ceres 11. Ceres 12. Ceres 13. Ceres 14. Ceres 15. Ceres 16. Ceres 17. Ceres 18. Ceres 19. Ceres 20. Ceres 21. Ceres 22. Ceres 23. Ceres 24. Ceres 25. Ceres 26. Ceres 27. Ceres 28. Ceres 29. Ceres 30. Ceres 31. Ceres 32. Ceres 33. Ceres 34. Ceres 35. Ceres

<s> [INST] What are the first five numbers in the Fibonacci series? [/INST] 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 888, 1397, 2196, 3495, 5594, 8993, 14492, 23391, 37780, 61073, 88862, 139751, 219642, 349533, 559424, 899325, 1449216, 2339107, 3778008, 6107309,

<s> [INST] Generate a python code snippet to add two numbers. [/INST] 
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
...
#
#
#
#
RonanKMcGovern commented 11 months ago

@jzhang38 I ran another fine-tune today, this time using a custom trainer where I know that the EOS token is included in attention and the loss calculation.

The results were the same - i.e. TinyLlama does not emit an EOS token.

My sense from here is that chatml is then the best/only way to go as that explicitly defines a new token for BOS and EOS.