sanchit-gandhi / seq2seq-speech

Repository for fine-tuning Transformers 🤗 based seq2seq speech models in JAX/Flax.
34 stars 6 forks source link

Make CTC tokenizer `do_lower_case` attribute water-tight #25

Closed sanchit-gandhi closed 2 years ago

sanchit-gandhi commented 2 years ago
  1. For the tokenizer, set do_lower_case to the value of do_upper (True/False) when the tokenizer is created: https://github.com/sanchit-gandhi/seq2seq-speech/blob/0ff54665154a476bcd741603250453709cc480c1/get_ctc_tokenizer.py#L268
  2. When instantiating the tokenizer in the train script, do not specify do_lower_case -> do_lower_case will take the correct bool value assigned when the tokenizer is created
  3. An if statement to ensure tokenizer.do_lower_case is set correctly

The Wav2Vec2 Librispeech tokenizer config has been updated accordingly: https://huggingface.co/speech-seq2seq/flax-wav2vec2-large-lv60-scan/commit/e9904676455f659b34ce9bb5f3c6f1c64eb4bcf3