Capitalization on output text

daniel-dona commented 1 year ago

Maybe this is a silly question, but why is the output of the pre-trained models always uppercase?

Is this some limitation/optimization or just the way the models were trained?

csukuangfj commented 1 year ago

why is the output of the pre-trained models always uppercase?

Not always, really. The reason why you always get uppercase output is that you are always using models that output uppercase.

During training, we normalize transcripts so that they are always uppercase or lowercase; so during inference, if the model is trained using uppercase texts, then it outputs uppercase; otherwise, it outputs lowercase.

You can have a look at tokens.txt. If it is all uppercase, then the output would also be all uppercase.

If you don't normalize your transcript during training, then you will get both lowercase and uppercase output during inference.

daniel-dona commented 1 year ago

That makes sense, thank you @csukuangfj

k2-fsa / sherpa

Capitalization on output text #395