Closed macabdul9 closed 3 months ago
Here
In data preparation for pseudo labelling -
def prepare_dataset(batch): # process audio sample = batch[audio_column_name] inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"]) # process audio length batch[model_input_name] = inputs.get(model_input_name)[0] # process targets input_str = batch[text_column_name] batch["labels"] = tokenizer(input_str, max_length=max_label_length, truncation=True).input_ids # record the id of the sample as token ids batch["file_id"] = tokenizer(batch[id_column_name], add_special_tokens=False).input_ids return batch
Fixed in #101!
Here
record the id of the sample as token ids
batch["file_id"] = tokenizer(batch[id_column_name], add_special_tokens=False).input_ids
In data preparation for pseudo labelling -