huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
MIT License
3.32k stars 238 forks source link

Pseudo-labelling librispeech_asr (train.360): KeyError `train-360` when not streaming. #96

Open guynich opened 3 months ago

guynich commented 3 months ago

When not streaming this line results in KeyError train-360. The pseudo-labelled dataset was not saved after hours of compute.

I think this KeyError might be caused by this code line that changes the split name.

My bash script uses the Librispeech_asr split name train.360 as defined here.

accelerate launch distil-whisper/training/run_pseudo_labelling.py \
  --model_name_or_path "openai/whisper-large-v2" \
  --dataset_name "librispeech_asr" \
  --dataset_config_name "clean" \
  --dataset_split_name "train.360+validation+test" \
  --text_column_name "text" \
  --id_column_name "id" \
  --output_dir "./datasets_distil_whisper/librispeech_asr_clean_en_medium_en_pseudo_labelled" \
  --per_device_eval_batch_size 64 \
  --dtype "bfloat16" \
  --dataloader_num_workers 16 \
  --preprocessing_num_workers 16 \
  --logging_steps 2000 \
  --max_label_length 128 \
  --task "transcribe" \
  --return_timestamps \
  --attn_type "flash_attn" \
  --streaming False \
  --generation_num_beams 1 \
  --decode_token_ids False \
  --push_to_hub False
guynich commented 3 months ago

I commented out code line and my bash script ran to completion.

e.g.:

# make the split name pretty for librispeech etc
# split = split.replace(".", "-").split("/")[-1]