I have trained two models from scratch using StarCoderData. Both models have the same Transformer-Decoder architecture and parameters. The only differences between the two models are the tokenizer and vocabulary used. One model utilizes the Huggingface BPE Tokenizer with StarCoder Vocab, while the other model uses the tiktoken tokenizer (details provided below).
However, I have noticed that the two models yield different loss values during training:
Huggingface Tokenizer: The loss follows a normal pattern, decreasing from 5 to 1.09 as training progresses.
Tiktoken Tokenizer: The loss exhibits an unusual pattern, starting at 20 and then becoming negative infinity (-inf), NaN (not a number), or extremely small values such as 1e-21 and 1e-23.
Has anyone encountered a similar question before, or could you provide me with some suggestions to help solve this issue? Thank you very much.
I have trained two models from scratch using StarCoderData. Both models have the same Transformer-Decoder architecture and parameters. The only differences between the two models are the tokenizer and vocabulary used. One model utilizes the Huggingface BPE Tokenizer with StarCoder Vocab, while the other model uses the tiktoken tokenizer (details provided below).
However, I have noticed that the two models yield different loss values during training:
Huggingface Tokenizer: The loss follows a normal pattern, decreasing from 5 to 1.09 as training progresses.
Tiktoken Tokenizer: The loss exhibits an unusual pattern, starting at 20 and then becoming negative infinity (-inf), NaN (not a number), or extremely small values such as 1e-21 and 1e-23.
Has anyone encountered a similar question before, or could you provide me with some suggestions to help solve this issue? Thank you very much.
Detail Huggingface Tokenizer and tiktoken
Huggingface Tokenizer: ``` from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder") ``` Tiktoken Tokenizer: ``` import base64 import tiktoken class Tokenizer: ENDOFTEXT = "<|endoftext|>" FIM_PREFIX = "<|fim_prefix|>" FIM_MIDDLE = "<|fim_middle|>" FIM_SUFFIX = "<|fim_suffix|>" FIM_PAD = "