ivrit-ai / ivrit.ai

ivrit.ai codebase
MIT License
24 stars 9 forks source link

improved text cleaning #34

Open MetroCat69 opened 5 months ago

MetroCat69 commented 5 months ago

in order to use the old text cleaning function just switch normalizer = clean_text to normalizer = whisper.normalizers.BasicTextNormalizer() in the func evaluate_model(

yairl commented 5 months ago

in order to use the old text cleaning function just switch normalizer = clean_text to normalizer = whisper.normalizers.BasicTextNormalizer() in the func evaluate_model(

  1. We chose Whisper's BasicTextNormalizer to have an apples-to-apples comparison when analyzing Whisper's results on our datasets. At a minimum, it makes sense to keep both evaluation methods, having BasicTextNormalizer as the default but allowing a switch to the second method.

  2. What is the actual difference between both methods? Which cases did you find where BasicTextNormalizer was subpar? It would be good to know what type of improvements we're expecting.

MetroCat69 commented 5 months ago
  1. removing punctuation, multiple spaces
  2. for example when the difference is different punctuation. for example using the old algorithm I love trains! is different that I love trains. and you will get a wer of .33 Using the improved text cleaner the wer is be 0

MetroCat69 commented 5 months ago

Maybe we should do one text with the original text cleaner and one test with the improved text cleaner