rhasspy / piper

A fast, local neural text to speech system
https://rhasspy.github.io/piper-samples/
MIT License
4.38k stars 297 forks source link

Kernel size can't be greater than actual input size #482

Closed ShawnHymel closed 3 weeks ago

ShawnHymel commented 3 weeks ago

I'm trying to convert LLM replies to sound via Piper TTS. The call sometimes fails with:

Traceback (most recent call last):
  File "/home/pi/Projects/GitHub/hopper-chat/hopper-llama.py", line 285, in <module>
    wav = tts.tts(text=f"Here. {reply}")
  File "/home/pi/.local/lib/python3.9/site-packages/TTS/api.py", line 276, in tts
    wav = self.synthesizer.tts(
  File "/home/pi/.local/lib/python3.9/site-packages/TTS/utils/synthesizer.py", line 398, in tts
    outputs = synthesis(
  File "/home/pi/.local/lib/python3.9/site-packages/TTS/tts/utils/synthesis.py", line 221, in synthesis
    outputs = run_model_torch(
  File "/home/pi/.local/lib/python3.9/site-packages/TTS/tts/utils/synthesis.py", line 53, in run_model_torch
    outputs = _func(
  File "/home/pi/.local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/pi/.local/lib/python3.9/site-packages/TTS/tts/models/forward_tts.py", line 689, in inference
    o_en, x_mask, g, _ = self._forward_encoder(x, x_mask, g)
  File "/home/pi/.local/lib/python3.9/site-packages/TTS/tts/models/forward_tts.py", line 408, in _forward_encoder
    o_en = self.encoder(torch.transpose(x_emb, 1, -1), x_mask, g)
  File "/home/pi/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/pi/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/pi/.local/lib/python3.9/site-packages/TTS/tts/layers/feed_forward/encoder.py", line 161, in forward
    o = self.encoder(x, x_mask)
  File "/home/pi/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/pi/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/pi/.local/lib/python3.9/site-packages/TTS/tts/layers/feed_forward/encoder.py", line 71, in forward
    o = self.res_conv_block(o, x_mask)
  File "/home/pi/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/pi/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/pi/.local/lib/python3.9/site-packages/TTS/tts/layers/generic/res_conv_bn.py", line 123, in forward
    o = block(o)
  File "/home/pi/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/pi/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/pi/.local/lib/python3.9/site-packages/TTS/tts/layers/generic/res_conv_bn.py", line 79, in forward
    return self.conv_bn_blocks(x)
  File "/home/pi/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/pi/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/pi/.local/lib/python3.9/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/home/pi/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/pi/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/pi/.local/lib/python3.9/site-packages/TTS/tts/layers/generic/res_conv_bn.py", line 42, in forward
    o = self.conv1d(x)
  File "/home/pi/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/pi/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/pi/.local/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 310, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/pi/.local/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 306, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Calculated padded input size per channel: (4). Kernel size: (7). Kernel size can't be greater than actual input size

The best I can tell, the model needs sentences longer than 2 words, or it throws this error. For example, I ask the LLM for a joke, and it returns the following that's split into 3 sentences.

 > Text splitted to sentences.
['Here.', "Why don't scientists trust atoms?", 'Because they make up everything!']

I'm using the en_US-lessac-medium model.

Any ideas on how to fix this? I'm assuming that I can force the LLM to give me 3+ word sentences and/or check the sentences manually in Python, but I have to imagine that a TTS engine should be able to handle 1-word sentences.

ShawnHymel commented 3 weeks ago

Wow...oops. I'm mixing my TTS engines. I was using Coqui TTS before transitioning over to Piper, and it looks like my code is still calling the Coqui engine. Single-word sentences work with Piper. Closing the issue.