<BOS> token not added, loss calculation instead skips the first word

I have been carefully debugging the loss function (for my own modifications), and found that labels were skipping the first word. Backing through the code, in datasets.py the comment explicitly assumes that bos will be added and ignored.

I'm using the default tokenizer/model as provided in the code:

xiuyul/mamba-2.8b-zephyr
GPTNeoXTokenizerFast

        # We treat image patches as "tokens = [p1 p2 p3, ...]"; we need to specify ordering of text/patch tokens.
        #   => Critically, we find that inserting *after* the BOS token leads to the strongest performance!
        #       - input_ids = "<s> p1 p2 p3 ... <caption_text> \n"
        #       - labels = "IGNORE IGNORE ..." (copy `input_ids` replacing <s> and p{1...K} with IGNORE)
        #
        # IMPORTANT => IF WE'RE USING HF LLM.forward(... labels=labels), SHIFTING HAPPENS _INSIDE_ MODEL!
        input_ids = self.tokenizer(caption, truncation=True, return_tensors="pt").input_ids[0]
        labels = copy.deepcopy(input_ids)

        # Set the <BOS> token's label to IGNORE_INDEX (since we're inserting the image patches right after)
        print(labels[0]) # SHOULD PRINT 0
        labels[0] = IGNORE_INDEX

But printing labels[0] before setting does not give the expected bos_token_id (0 in our case), but instead prints the first word's token (often the word 'the'). Here's an example list of what it prints: 783, 1288, 783, 78, 69, 11707, 66. The finetuning version of the dataset loader prints all [-100]s, but I assume that's not correct.

Can you confirm this is not expected behavior? I think your cached tokenizer is correctly adding BOS tokens, but the ones I've downloaded from huggingface are not adding the BOS tokens.

h-zhao1997 / cobra

<BOS> token not added, loss calculation instead skips the first word #13