This is a placeholder Pull Request, I will clean this up/amend appropriately, if we have the new voice is-steinn-xs.onnx based on our own phonemization.
Implement a preliminary runtime for VITS voice is-steinn-medium.onnx
The voice 'is-steinn-medium.onnx' uses phonemization based on the ancient eSpeak IPA dialect and was purely trained on eSpeak phonemizeation.
As we are not using eSpeak inside Símarómur, try a naive approach in emulating the eSpeak IPA dialect and adapt the model inputs with the appropriate phoneme conversions, like padding every symbol with 0, adding BOS, EOS, etc.
The resulting voice performance is quite acceptable for demo purposes and also shows promising runtime performance.
add onnxruntime for the voice model and add a new TTSEngineOnnx class, which does all onnx model loading and inference handling
add Pronunciation for VITS via the class PronunciationVits and also add appropriate classes for the other used pronunciation formats via classes PronuncationFP2, PronunciationFlite
add pronunciation dictionary with Word -> IPA symbols. These symbols use a compressed format without any padding or spaces in between. Therefore, we need to retokenize each IPA pronuncation again to split the dictionary entry into single symbols to be able to convert these to the input ids for the VITS model
Add new pojo class VitsConfig that provides interpretation of the Vits model configuration file to be able to read the phonetic alphabet -> phoneme id mapping
This is a placeholder Pull Request, I will clean this up/amend appropriately, if we have the new voice
is-steinn-xs.onnx
based on our own phonemization.Implement a preliminary runtime for VITS voice
is-steinn-medium.onnx
The voice 'is-steinn-medium.onnx' uses phonemization based on the ancient eSpeak IPA dialect and was purely trained on eSpeak phonemizeation. As we are not using eSpeak inside Símarómur, try a naive approach in emulating the eSpeak IPA dialect and adapt the model inputs with the appropriate phoneme conversions, like padding every symbol with
0
, addingBOS
,EOS
, etc.The resulting voice performance is quite acceptable for demo purposes and also shows promising runtime performance.
onnxruntime
for the voice model and add a newTTSEngineOnnx
class, which does all onnx model loading and inference handlingPronunciationVits
and also add appropriate classes for the other used pronunciation formats via classesPronuncationFP2
,PronunciationFlite
Word -> IPA symbols
. These symbols use a compressed format without any padding or spaces in between. Therefore, we need to retokenize each IPA pronuncation again to split the dictionary entry into single symbols to be able to convert these to the input ids for the VITS modelphonetic alphabet -> phoneme id
mapping