Inverse text normalization of numbers etc.

The restorer in speechbox is very good at restoring punctuation and capitalization in sentences without spelled out numbers.

However, for CTC based ASR approaches like Wav2Vec2, numbers are spelled out as the vocabulary usually only contains alphabetical characters. In order to use the restorer on the output of a Wav2Vec2 model, one would therefore need to first do inverse normalization of numericals using e.g. language specific FSTs.

I wonder whether it is possible to use the output of Whisper (which is already alphanumerical) within this method to simultaneously do inverse normalization as well as punctuation and capitalization, thereby creating a universal post-processor for CTC models, which supports all languages included in Whisper.

Here are some examples (in Danish), using the Whisper-medium multilingual model:

An audio clip saying the words: seksogtres tusind fem hundrede og enoghalvfjerds (66.571) produces the following output with the restorer: SEKSOGTRES TUSIND' FEM HUNDREDE OG enoghalvfjerds

otteogtres tusind tre hundrede og toogtyve (68.322) becomes: Otteogtres-tusind-trehundredeog-toogtyve

How feasible is it to have ITN as a first step in the method?

huggingface / speechbox

Inverse text normalization of numbers etc. #24