openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.76k stars 801 forks source link

An unusual loss pattern when training a model from scratch #149

Closed ericxsun closed 1 year ago

ericxsun commented 1 year ago

I have trained two models from scratch using StarCoderData. Both models have the same Transformer-Decoder architecture and parameters. The only differences between the two models are the tokenizer and vocabulary used. One model utilizes the Huggingface BPE Tokenizer with StarCoder Vocab, while the other model uses the tiktoken tokenizer (details provided below).

However, I have noticed that the two models yield different loss values during training:

Has anyone encountered a similar question before, or could you provide me with some suggestions to help solve this issue? Thank you very much.

Detail Huggingface Tokenizer and tiktoken Huggingface Tokenizer: ``` from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder") ``` Tiktoken Tokenizer: ``` import base64 import tiktoken class Tokenizer: ENDOFTEXT = "<|endoftext|>" FIM_PREFIX = "<|fim_prefix|>" FIM_MIDDLE = "<|fim_middle|>" FIM_SUFFIX = "<|fim_suffix|>" FIM_PAD = "" FILENAME = "" GH_STARS = "" ISSUE_START = "" ISSUE_COMMENT = "" ISSUE_CLOSED = "" JUPYTER_START = "" JUPYTER_TEXT = "" JUPYTER_CODE = "" JUPYTER_OUTPUT = "" EMPTY_OUTPUT = "" COMMIT_BEFORE = "" COMMIT_MSG = "" COMMIT_AFTER = "" REPONAME = "" ENDOFPROMPT = "<|endofprompt|>" def __init__(self, vocab_name): filename = "cl100k_base.tiktoken" # download from https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken ranks = { base64.b64decode(token): int(rank) for token, rank in (line.split() for line in open(filename) if line) } special_tokens = { self.ENDOFTEXT: 100257, self.FIM_PREFIX: 100258, self.FIM_MIDDLE: 100259, self.FIM_SUFFIX: 100260, self.ENDOFPROMPT: 100276, } specials = [ self.FIM_PAD, self.FILENAME, self.GH_STARS, self.ISSUE_START, self.ISSUE_COMMENT, self.ISSUE_CLOSED, self.JUPYTER_START, self.JUPYTER_TEXT, self.JUPYTER_CODE, self.JUPYTER_OUTPUT, self.EMPTY_OUTPUT, self.COMMIT_BEFORE, self.COMMIT_MSG, self.COMMIT_AFTER, self.REPONAME ] for token, token_id in zip(specials, range(100261, 100276)): special_tokens[token] = token_id self.tokenizer = tiktoken.Encoding( name="cl100k_base_enhanced", pat_str=r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+""", mergeable_ranks=ranks, special_tokens=special_tokens ) def encode(self, text): return self.tokenizer.encode(text, disallowed_special=()) def decode(self, token_ids): return self.tokenizer.decode(token_ids) ```
hauntsaninja commented 1 year ago

Does your model have the right vocab size? Are you sure you ran the right ablation?

This isn't really meant to be help forum, you'll probably get better help at a ML or Python discord server.