toverainc / willow-inference-server

Open source, local, and self-hosted highly optimized language inference server supporting ASR/STT, TTS, and LLM across WebRTC, REST, and WS
Apache License 2.0
383 stars 35 forks source link

T2S - text longer than 600 characters causes "RuntimeError: The size of tensor a () must match the size of tensor b (600) at non-singleton dimension 1" #127

Open ivan-homoliak-sutd opened 1 year ago

ivan-homoliak-sutd commented 1 year ago

Using /api/tts with the text longer than 600 characters, WIS output the following error:

RuntimeError: The size of tensor a (607) must match the size of tensor b (600) at non-singleton dimension 1

I am using Tesla T4. I am not sure, whether it is intentional, encouraging users to make more GET requests. URL length might still allow more characters (though it has some limitations). Maybe some alternative POST API method (for longer texts) would be a nice option.

kristiankielhofner commented 1 year ago

Based on that output the issue here isn't the URL (although you're correct - that's a very long URL). The issue is the limit of input to the model. Regardless of how the text gets to WIS the model is going to be the limiting factor.

In this case you're getting just past the character limit of the SpeechT5 model we currently use.

Generally speaking Willow isn't currently designed for TTS output that long but we realize it can be frustrating for users using ChatGPT or similar that routinely provides output longer than what our current TTS model supports. In the six months of Willow development (as of today!) our primary focus up to this point has been scenarios where the TTS input doesn't even come close to 600 characters. For our current focus of home automation platforms even things like weather, etc end up around 50-100 characters (max).

As this and your other issues demonstrate TTS is currently somewhat of a weak spot for us. Generally speaking the open models people can self-host are quite a bit behind what you get from hosted commercial providers and people are often surprised by this. Needless to say the commercially available TTS models have had untold resources dedicated to their development, hosting, etc - resources you just don't see in the self-host model ecosystem.

All of that said we've been evaluating a variety of other self-host models and implementations available. Unfortunately, everything we've tried so far has issues ranging from voice quality to execution speed, where even on the latest and greatest GPUs it can take a VERY long time to provide TTS feedback. Especially for cases like this where you have a lot of input text. With a Tesla P4 and these alternate implementations you would likely be waiting well over a minute for TTS feedback which we view as completely unacceptable. As I was saying even my RTX 3090 at home is far too slow with these implementations for the level of interactivity we aim for.

It's something were well aware of but as of today there aren't any better options for self-host use cases like this.