huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
MIT License
3.33k stars 238 forks source link

Why do we need to tokenized file_id? #82

Closed macabdul9 closed 3 months ago

macabdul9 commented 4 months ago

Here

record the id of the sample as token ids

batch["file_id"] = tokenizer(batch[id_column_name], add_special_tokens=False).input_ids

In data preparation for pseudo labelling -

def prepare_dataset(batch):
        # process audio
        sample = batch[audio_column_name]
        inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"])
        # process audio length
        batch[model_input_name] = inputs.get(model_input_name)[0]

        # process targets
        input_str = batch[text_column_name]
        batch["labels"] = tokenizer(input_str, max_length=max_label_length, truncation=True).input_ids

        # record the id of the sample as token ids
        batch["file_id"] = tokenizer(batch[id_column_name], add_special_tokens=False).input_ids
        return batch
sanchit-gandhi commented 3 months ago

Fixed in #101!