Error in DataCollatorSpeechSeq2SeqWithPadding (Unit 5)

tonywu71 commented 1 year ago

In the unit 5 of the audio course, the following code is used:

class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(
        self, features: List[Dict[str, Union[List[int], torch.Tensor]]]
    ) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [
            {"input_features": feature["input_features"][0]} for feature in features
        ]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(
            labels_batch.attention_mask.ne(1), -100
        )

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

However, according to the following issue, bos_token_id shouldn't be used (@ArthurZucker). In my opinion, this should be replaced with self.processor.tokenizer.convert_tokens_to_ids("<|startoftranscript|>") or with model.config.decoder_start_token_id. What do you think?

Note if this is true, then there would be a similar error in @sanchit-gandhi's fine-tuning tutorial too.

Thanks for your attention.

Regards, Tony

ArthurZucker commented 1 year ago

If the self.processor.tokenizer.bos_token_id is correctly set, (it should not be used in the sense that it is not used, of forced_decoder_ids is set it will be taken, instead of this one) then I don't really see a problem

tonywu71 commented 1 year ago

If you look at the issue I mentioned, it seems that you stated the opposite. Here's what you wrote for context:

I'll update the documentation to make it less confusing. The token used to store the "<|startoftranscript|>" token is decoder_start_token_id. The bos_token is pretty much unused, which is why it was set to the same as eos_token.

I might be wrong. But forced_decoder_ids is only used with generate and not with forward. And I checked in the forward source code: with the current collator, the batch["labels"] would always start with <|startoftranscript|>. However, a redundant <|startoftranscript|> would also be appended when using forward with the labels argument and when decoder_input_ids is None (see source code). Let me know if you think I'm wrong.

MKhalusova commented 1 year ago

cc @sanchit-gandhi

huggingface / audio-transformers-course

Error in DataCollatorSpeechSeq2SeqWithPadding (Unit 5) #85