rendchevi / nix-tts

🐤 Nix-TTS: Lightweight and End-to-end Text-to-Speech via Module-wise Distillation
MIT License
243 stars 33 forks source link

Tuning and compilation with Apache TVM. #8

Open pawelurbanski opened 2 years ago

pawelurbanski commented 2 years ago

Dear Developers, I spent some time trying to figure out how to compile NIX-TTS using Apache TVM. The idea was the same in concept to your compilation for Rasberry Pi. I discovered the following:

  1. Apache TVM requires inputs to have static not dynamic shapes.
  2. NIX-TTS models differ in how shapes of inputs and outputs are constructed based on being detoministic or stochastic model.

It is possible to change the input shape to static, using for example the Sclblonnx package. I am unfortunately not that much involved in neural network frameworks to be able to figure out the required fixes to allow compilation.

Since the detoministic model is faster and its voice quality is still fantastic, I had alook at the inputs:

  1. The "encoder" network most important inputs are both dynamic: Input: "c" as INT64 with dimensions: [0, 0] Input: "c_lengths" as INT64 with dimension: [0]
  2. The "decoder" network only input is as follows: ** Input: "z" as FLOAT with dimensions: [0, 0, 0]

What would be the procedure to make the inputs static, if it is possible for the detoministic model? What would be the same case for the stochastic model if it is possible. Last but not least: since the detoministic model's encoder output is a dynamic shape as the decoder input, would it be possible to merge both graphs as a single model file?

While I don't know if my case is possible to be implemented, I will be more than happy to descripe the procedure or publish the compiled model.

Thank you in advance for any hints and feedback...