Hi, I have a text were the audio includes numbers (e.g. 16, 29, 32) and the whisperx loads the information and transcript perfect, but when I try to run the word alignment, I stumble upon an issue - the numbers are separated out as words and for that reason they have empty start time and end time values. For the wav2vec models I tried, metadata only includes non-numerical characters [a-z].
Has anyone had any other similar issue and maybe know a wav2vec (from huggingface) model in English that would solve this issue?
Hi, I have a text were the audio includes numbers (e.g. 16, 29, 32) and the
whisperx
loads the information and transcript perfect, but when I try to run the word alignment, I stumble upon an issue - the numbers are separated out as words and for that reason they have empty start time and end time values. For the wav2vec models I tried, metadata only includes non-numerical characters [a-z].Has anyone had any other similar issue and maybe know a wav2vec (from huggingface) model in English that would solve this issue?
Thanks for help in advance,