WIP: Feat/experimental vits

This is a placeholder Pull Request, I will clean this up/amend appropriately, if we have the new voice is-steinn-xs.onnx based on our own phonemization.

Implement a preliminary runtime for VITS voice `is-steinn-medium.onnx`

The voice 'is-steinn-medium.onnx' uses phonemization based on the ancient eSpeak IPA dialect and was purely trained on eSpeak phonemizeation. As we are not using eSpeak inside Símarómur, try a naive approach in emulating the eSpeak IPA dialect and adapt the model inputs with the appropriate phoneme conversions, like padding every symbol with 0, adding BOS, EOS, etc.

The resulting voice performance is quite acceptable for demo purposes and also shows promising runtime performance.

add onnxruntime for the voice model and add a new TTSEngineOnnx class, which does all onnx model loading and inference handling
add Pronunciation for VITS via the class PronunciationVits and also add appropriate classes for the other used pronunciation formats via classes PronuncationFP2, PronunciationFlite
- add pronunciation dictionary with Word -> IPA symbols. These symbols use a compressed format without any padding or spaces in between. Therefore, we need to retokenize each IPA pronuncation again to split the dictionary entry into single symbols to be able to convert these to the input ids for the VITS model
- Add new pojo class VitsConfig that provides interpretation of the Vits model configuration file to be able to read the phonetic alphabet -> phoneme id mapping

grammatek / simaromur