mesolitica / malaya-speech

Speech Toolkit for Malaysian language, https://malaya-speech.readthedocs.io/
https://malaya-speech.readthedocs.io/
MIT License
240 stars 42 forks source link

parse dataset #33

Closed huseinzol05 closed 1 year ago

huseinzol05 commented 1 year ago

Directory https://github.com/huseinzol05/malaya-speech/tree/master/data/semisupervised-whisper-stt,

The technique is very simple,

  1. Filter audio sample based on language, only select {en, ms}
  2. Calculate logprob score for en and ms only, choose the best.

Read more at https://github.com/huseinzol05/malaya-speech/blob/master/data/semisupervised-whisper-stt/prepare-whisper-stt-part1.ipynb

Upload the dataset at https://huggingface.co/datasets/mesolitica/semisupervised-whisper-stt