h-zhao1997 / cobra

Cobra: Extending Mamba to Multi-modal Large Language Model for Efficient Inference
MIT License
259 stars 8 forks source link

<BOS> token not added, loss calculation instead skips the first word #13

Closed mattolson93 closed 6 months ago

mattolson93 commented 6 months ago

I have been carefully debugging the loss function (for my own modifications), and found that labels were skipping the first word. Backing through the code, in datasets.py the comment explicitly assumes that bos will be added and ignored.

I'm using the default tokenizer/model as provided in the code:

        # We treat image patches as "tokens = [p1 p2 p3, ...]"; we need to specify ordering of text/patch tokens.
        #   => Critically, we find that inserting *after* the BOS token leads to the strongest performance!
        #       - input_ids = "<s> p1 p2 p3 ... <caption_text> \n"
        #       - labels = "IGNORE IGNORE ..." (copy `input_ids` replacing <s> and p{1...K} with IGNORE)
        #
        # IMPORTANT => IF WE'RE USING HF LLM.forward(... labels=labels), SHIFTING HAPPENS _INSIDE_ MODEL!
        input_ids = self.tokenizer(caption, truncation=True, return_tensors="pt").input_ids[0]
        labels = copy.deepcopy(input_ids)

        # Set the <BOS> token's label to IGNORE_INDEX (since we're inserting the image patches right after)
        print(labels[0]) # SHOULD PRINT 0
        labels[0] = IGNORE_INDEX

But printing labels[0] before setting does not give the expected bos_token_id (0 in our case), but instead prints the first word's token (often the word 'the'). Here's an example list of what it prints: 783, 1288, 783, 78, 69, 11707, 66. The finetuning version of the dataset loader prints all [-100]s, but I assume that's not correct.

Can you confirm this is not expected behavior? I think your cached tokenizer is correctly adding BOS tokens, but the ones I've downloaded from huggingface are not adding the BOS tokens.

h-zhao1997 commented 6 months ago

Thank you for your careful observation! We overlooked this issue because our training process does not include the pre-alignment phase. GPTNeoX's tokenizer does not have a token, so there is no need to mask the first label. We have fixed this bug in the latest version of the code.