I have been carefully debugging the loss function (for my own modifications), and found that labels were skipping the first word. Backing through the code, in datasets.py the comment explicitly assumes that bos will be added and ignored.
I'm using the default tokenizer/model as provided in the code:
xiuyul/mamba-2.8b-zephyr
GPTNeoXTokenizerFast
# We treat image patches as "tokens = [p1 p2 p3, ...]"; we need to specify ordering of text/patch tokens.
# => Critically, we find that inserting *after* the BOS token leads to the strongest performance!
# - input_ids = "<s> p1 p2 p3 ... <caption_text> \n"
# - labels = "IGNORE IGNORE ..." (copy `input_ids` replacing <s> and p{1...K} with IGNORE)
#
# IMPORTANT => IF WE'RE USING HF LLM.forward(... labels=labels), SHIFTING HAPPENS _INSIDE_ MODEL!
input_ids = self.tokenizer(caption, truncation=True, return_tensors="pt").input_ids[0]
labels = copy.deepcopy(input_ids)
# Set the <BOS> token's label to IGNORE_INDEX (since we're inserting the image patches right after)
print(labels[0]) # SHOULD PRINT 0
labels[0] = IGNORE_INDEX
But printing labels[0] before setting does not give the expected bos_token_id (0 in our case), but instead prints the first word's token (often the word 'the'). Here's an example list of what it prints: 783, 1288, 783, 78, 69, 11707, 66. The finetuning version of the dataset loader prints all [-100]s, but I assume that's not correct.
Can you confirm this is not expected behavior? I think your cached tokenizer is correctly adding BOS tokens, but the ones I've downloaded from huggingface are not adding the BOS tokens.
Thank you for your careful observation! We overlooked this issue because our training process does not include the pre-alignment phase. GPTNeoX's tokenizer does not have a token, so there is no need to mask the first label. We have fixed this bug in the latest version of the code.
I have been carefully debugging the loss function (for my own modifications), and found that labels were skipping the first word. Backing through the code, in datasets.py the comment explicitly assumes that bos will be added and ignored.
I'm using the default tokenizer/model as provided in the code:
But printing labels[0] before setting does not give the expected bos_token_id (0 in our case), but instead prints the first word's token (often the word 'the'). Here's an example list of what it prints: 783, 1288, 783, 78, 69, 11707, 66. The finetuning version of the dataset loader prints all [-100]s, but I assume that's not correct.
Can you confirm this is not expected behavior? I think your cached tokenizer is correctly adding BOS tokens, but the ones I've downloaded from huggingface are not adding the BOS tokens.