Silero TTS Extension Unable to Handle Long Text Inputs

2kbits commented 1 year ago

Describe the bug

While using the Silero TTS extension, I encountered an error when providing long text inputs. The model seems to have a limitation on the length of the input text it can handle.

Expected Behavior: The extension should be able to handle longer text (maybe break it up into batches?) and generate corresponding audio without any errors.

Actual Behavior: Received an error indicating that the text string is longer than 1000 symbols and the chat output is stuck at "Is recording a voice message..."

Possible Solution: Maybe not the best approach but one way could be to break the input text into smaller chunks, process each chunk with Silero, and then concatenate the audio outputs.

Additional Information:

I've tried this on my Windows 10 machine with Nvidia GPU, Firefox web broser and it was installed and updated with the automatic windows installer.
The issue seems to be consistent with any text input longer than 999 symbols.

Would appreciate any guidance or fixes for this issue. Thank you!

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

Steps to Reproduce:

Clone and set up the text-generation-webui repository.
Run the application with the Silero extension enabled and increase max_new_tokens to over 999 tokens.
Load any model and ask it something that requires it to generate some long text (ex: "explain to me how caffeine works").
Silero TTS should try to generate audio and crash with the error "Exception: Model couldn't generate your text, probably it's too long".

Screenshot

Capture

Logs

Output generated in 23.69 seconds (16.55 tokens/s, 392 tokens, context 49, seed 1579299594)
<torch_package_0>.multi_acc_v3_package.py:68: UserWarning: Text string is longer than 1000 symbols.
  warnings.warn('Text string is longer than 1000 symbols.')
Traceback (most recent call last):
  File "<torch_package_0>.multi_acc_v3_package.py", line 338, in apply_tts
    out, out_lens = self.model(**model_kwargs)
  File "C:\Users\tkbits\Desktop\AI\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File ".data/ts_code/code/__torch__/silero_vocoder/jit_model/___torch_mangle_138.py", line 29, in forward
    _3 = torch.to(durs_rate, torch.device(device))
    _4 = torch.to(pitch_coefs, torch.device(device))
    _5 = (tts_model).forward(_1, _2, sr, symb_durs, _3, _4, gt_durs, gt_pitch, device, )
          ~~~~~~~~~~~~~~~~~~ <--- HERE
    audio, audio_lengths, = _5
    return (audio, audio_lengths)
  File ".data/ts_code/code/__torch__/silero_vocoder/jit_model/___torch_mangle_137.py", line 153, in forward
      pitch_hat = unchecked_cast(Tensor, gt_pitch)
    tacotron = self.tacotron
    mel_outputs = (tacotron).forward(sequence, speaker_ids, orig_mask, dur_hat, pitch_hat, )
                   ~~~~~~~~~~~~~~~~~ <--- HERE
    if torch.__isnot__(symb_durs4, None):
      symb_durs20 = unchecked_cast(Dict[int, int], symb_durs4)
  File ".data/ts_code/code/__torch__/jit_forward_model/___torch_mangle_132.py", line 43, in forward
    encoder_outputs_expanded = (len_reg).forward(cond_encoder_outputs0, dur_hat, )
    decoder = self.decoder
    outputs_expanded = (decoder).forward(encoder_outputs_expanded, None, )
                        ~~~~~~~~~~~~~~~~ <--- HERE
    lin = self.lin
    mel_outputs = (lin).forward(outputs_expanded, )
  File ".data/ts_code/code/__torch__/tacotron2/fastpitch_layers.py", line 118, in forward
    x8 = torch.transpose(x, 0, 1)
    pos_encoder = self.pos_encoder
    x9 = (pos_encoder).forward(x8, )
          ~~~~~~~~~~~~~~~~~~~~ <--- HERE
    x10 = torch.transpose(x9, 0, 1)
    pre_vanilla_layers = self.pre_vanilla_layers
  File ".data/ts_code/code/__torch__/tacotron2/fastpitch_layers.py", line 42, in forward
    _0 = torch.slice(pe, 0, None, torch.size(x, 0))
    _1 = torch.mul(scale, torch.slice(_0, 1))
    x7 = torch.add(x, _1)
         ~~~~~~~~~ <--- HERE
    dropout = self.dropout
    return (dropout).forward(x7, )

Traceback of TorchScript, original code (most recent call last):
  File "../../silero_vocoder/jit_model.py", line 69, in forward
        sequence, symb_durs, durs_rate, pitch_coefs = self.merge_batch_model(sentences, break_lens, prosody_rates, prosody_pitches)

        audio, audio_lengths = self.tts_model(sequence=sequence.to(device),
                               ~~~~~~~~~~~~~~ <--- HERE
                                              speaker_ids=speaker_ids.to(device),
                                              sr=sr,
  File "../../silero_vocoder/jit_model.py", line 457, in forward
            pitch_hat = gt_pitch

        mel_outputs = self.tacotron(sequence, speaker_ids, orig_mask, dur_hat, pitch_hat)
                      ~~~~~~~~~~~~~ <--- HERE
        if symb_durs is not None and len(symb_durs) > 0:
            mel_outputs = self.fx_pauses(mel_outputs, dur_hat, symb_durs)
  File "/home/keras/notebook/nvme2/islanna/silero_vocoder/tacotron2/jit_forward_model.py", line 110, in forward

        # [B, Lexp, Denc]
        outputs_expanded = self.decoder(encoder_outputs_expanded,
                           ~~~~~~~~~~~~ <--- HERE
                                        src_pad_mask=None)

  File "../tacotron2/fastpitch_layers.py", line 335, in forward
        x = x.transpose(0, 1)
        # [L, B, d_m]
        x = self.pos_encoder(x)
            ~~~~~~~~~~~~~~~~ <--- HERE
        # [B, L, d_m]
        x = x.transpose(0, 1)
  File "../tacotron2/fastpitch_layers.py", line 53, in forward
    def forward(self, x: torch.Tensor) -> torch.Tensor:         # shape: [T, N]
        x = x + self.scale * self.pe[:x.size(0), :]
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        return self.dropout(x)
RuntimeError: The size of tensor a (7094) must match the size of tensor b (5000) at non-singleton dimension 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\tkbits\Desktop\AI\oobabooga_windows\installer_files\env\lib\site-packages\gradio\routes.py", line 427, in run_predict
    output = await app.get_blocks().process_api(
  File "C:\Users\tkbits\Desktop\AI\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 1323, in process_api
    result = await self.call_function(
  File "C:\Users\tkbits\Desktop\AI\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 1067, in call_function
    prediction = await utils.async_iteration(iterator)
  File "C:\Users\tkbits\Desktop\AI\oobabooga_windows\installer_files\env\lib\site-packages\gradio\utils.py", line 336, in async_iteration
    return await iterator.__anext__()
  File "C:\Users\tkbits\Desktop\AI\oobabooga_windows\installer_files\env\lib\site-packages\gradio\utils.py", line 329, in __anext__
    return await anyio.to_thread.run_sync(
  File "C:\Users\tkbits\Desktop\AI\oobabooga_windows\installer_files\env\lib\site-packages\anyio\to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "C:\Users\tkbits\Desktop\AI\oobabooga_windows\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "C:\Users\tkbits\Desktop\AI\oobabooga_windows\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "C:\Users\tkbits\Desktop\AI\oobabooga_windows\installer_files\env\lib\site-packages\gradio\utils.py", line 312, in run_sync_iterator_async
    return next(iterator)
  File "C:\Users\tkbits\Desktop\AI\oobabooga_windows\text-generation-webui\modules\chat.py", line 305, in generate_chat_reply_wrapper
    for i, history in enumerate(generate_chat_reply(text, state, regenerate, _continue, loading_message=True)):
  File "C:\Users\tkbits\Desktop\AI\oobabooga_windows\text-generation-webui\modules\chat.py", line 290, in generate_chat_reply
    for history in chatbot_wrapper(text, state, regenerate=regenerate, _continue=_continue, loading_message=loading_message):
  File "C:\Users\tkbits\Desktop\AI\oobabooga_windows\text-generation-webui\modules\chat.py", line 261, in chatbot_wrapper
    output['visible'][-1][1] = apply_extensions('output', output['visible'][-1][1], state, is_chat=True)
  File "C:\Users\tkbits\Desktop\AI\oobabooga_windows\text-generation-webui\modules\extensions.py", line 223, in apply_extensions
    return EXTENSION_MAP[typ](*args, **kwargs)
  File "C:\Users\tkbits\Desktop\AI\oobabooga_windows\text-generation-webui\modules\extensions.py", line 81, in _apply_string_extensions
    text = func(*args, **kwargs)
  File "C:\Users\tkbits\Desktop\AI\oobabooga_windows\text-generation-webui\extensions\silero_tts\script.py", line 129, in output_modifier
    model.save_wav(ssml_text=silero_input, speaker=params['speaker'], sample_rate=int(params['sample_rate']), audio_path=str(output_file))
  File "<torch_package_0>.multi_acc_v3_package.py", line 366, in save_wav
    audio = self.apply_tts(text=text,
  File "<torch_package_0>.multi_acc_v3_package.py", line 340, in apply_tts
    raise Exception("Model couldn't generate your text, probably it's too long")
Exception: Model couldn't generate your text, probably it's too long

System Info

OS Name:                Microsoft Windows 10 Pro
OS Version:             10.0.19045 N/A Build 19045
CPU:                    AMD Ryzen 5 3600 6-Core Processor
RAM:                    32 GB
GPU:                    NVIDIA GeForce RTX 3090
GPU Driver version: 31.0.15.3623

RandomInternetPreson commented 1 year ago

I don't I think this is a bug necessarily, this happens to me too and I think it's just the intrinsic property of the extension.

2kbits commented 1 year ago

I'm not entirely sure how it works as I'm still new to working with public projects but even so I was still hoping we could break up the text into sizable chunks for the extension instead of it just breaking randomly I'm not sure how feasible that is though.

Quidam2k commented 1 year ago

I'm half tempted to throw the extension's code into GPT and see if it's got any ideas on how to modify it to generate the speech in batches every time it sees the newline character, but I'd love it if someone beat me to it.

Upcycle-Electronics commented 1 year ago

Here is a hack for use in the interm (just replace the output_modifier method in script.py with this one). I am arbitrarily checking the raw string length, if it is too large, I am splitting the output string into sentences. At an arbitrary length, I am cutting off the output and then eliminating the partial sentence that this intersects before rebuilding the string at the smaller length. I also print the part of the string that was truncated at the arbitrary length in the terminal (but not the portion of this sentence from before the split). This only impacts the Silero audio, so you'll still see the whole text output in the WebUI. Ideally, this link has a better potential fix (https://github.com/snakers4/silero-models/pull/174) but I still have not been able to get it fully integrated yet. I have not been able to save the output from the Textgen Silero extension as a tensor to pass it to the methods in the BigTextToAudio class to use torch cat. I had something working standalone a couple of weeks ago that used a temp file on the system, this then generated a series of wave files and then concatenated them afterwards, but I can't seem to find where I put that and my sloppy mess was not portable or worth sharing. I'm just giving ideas. The hack below could be done better but it works just to stop the freezing issue.

def output_modifier(string, state):
    global model, current_params, streaming_state

    for i in params:
        if params[i] != current_params[i]:
            model = load_model()
            current_params = params.copy()
            break

    if not params['activate']:
        return string

    original_string = string
    string = tts_preprocessor.preprocess(html.unescape(string))

    if string == '':
        string = '*Empty reply, try regenerating*'
    # elif len(string) > 800:
        # print("Debug Test 1: calling generate_audio(string)")
        # generate_audio(string)
    else:
        output_file = Path(f'extensions/silero_tts/outputs/{state["character_menu"]}_{int(time.time())}.wav')
        prosody = '<prosody rate="{}" pitch="{}">'.format(params['voice_speed'], params['voice_pitch'])
        silero_input = f'<speak>{prosody}{xmlesc(string)}</prosody></speak>'
        if len(silero_input) > 900:    # added the following few lines to truncate long failing outputs
            print("truncating the following + the last split sentence = ", silero_input[900:])
            sentences = silero_input[:900].rstrip().split('.')
            silero_input = ''.join([sentence + '.' for sentence in sentences[:-1]])+"</prosody></speak>"
        model.save_wav(ssml_text=silero_input, speaker=params['speaker'], sample_rate=int(params['sample_rate']), audio_path=str(output_file))

        autoplay = 'autoplay' if params['autoplay'] else ''
        string = f'<audio src="file/{output_file.as_posix()}" controls {autoplay}></audio>'
        if params['show_text']:
            string += f'\n\n{original_string}'

    shared.processing_message = "*Is typing...*"
    return string

AndrewBenavides commented 1 year ago

@Upcycle-Electronics I hope you don't mind, but I fiddled around with your hack and was able to get generate slices of the original message and concatenate the wave files together into one complete output of the message. I shortened the slice length to 800 characters because I was still hitting 'too long' errors occasionally at 900. Also, apologies if my code seems janky -- Python isn't my primary language and I didn't have a proper debug/test cycle set up.

import wave

def output_modifier(string, state):
    global model, current_params, streaming_state

    for i in params:
        if params[i] != current_params[i]:
            model = load_model()
            current_params = params.copy()
            break

    if not params['activate']:
        return string

    original_string = string
    string = tts_preprocessor.preprocess(html.unescape(string))

    if string == '':
        string = '*Empty reply, try regenerating*'
    else:
        xmlesc_string = xmlesc(string)
        prosody = '<prosody rate="{}" pitch="{}">'.format(params['voice_speed'], params['voice_pitch'])
        output_file = Path(f'extensions/silero_tts/outputs/{state["character_menu"]}_{int(time.time())}.wav')
        wave_data = []
        slice = 800
        while len(xmlesc_string) > 0:
            silero_open_tags = f'<speak>{prosody}'
            silero_close_tags = '</prosody></speak>'
            silero_input = f'{silero_open_tags}{xmlesc_string}{silero_close_tags}'
            if (len(silero_input) > slice):
                print("truncating the following + the last split sentence =", silero_input[slice:])
                sentences = silero_input[:slice].rstrip().split('.')
                silero_input = ''.join([sentence + '.' for sentence in sentences[:-1]])+"</prosody></speak>"
            string_len = len(silero_input) - len(silero_open_tags) - len(silero_close_tags)
            xmlesc_string = xmlesc_string[string_len:]
            model.save_wav(ssml_text=silero_input, speaker=params['speaker'], sample_rate=int(params['sample_rate']), audio_path=str(output_file))
            with wave.open(str(output_file), 'rb') as w:
                wave_data.append( [w.getparams(), w.readframes(w.getnframes())] )
            Path.unlink(output_file)

        with wave.open(str(output_file), 'wb') as output:
            output.setparams(wave_data[0][0])
            for i in range(len(wave_data)):
                output.writeframes(wave_data[i][1])

        autoplay = 'autoplay' if params['autoplay'] else ''
        string = f'<audio src="file/{output_file.as_posix()}" controls {autoplay}></audio>'
        if params['show_text']:
            string += f'\n\n{original_string}'

    shared.processing_message = "*Is typing...*"
    return string

Upcycle-Electronics commented 1 year ago

@AndrewBenavides You will still have potential problems. I did some testing and the real problem is the SSML input text mode for Silero is not good to use. The extension should just use the regular voices available and send plain text strings this would greatly simplify the extension and make it much easier to extend. I was working on this but I am unable to overcome the update now. The webui won't work, The update fails because I have a dozen modified files and the linux update script has completely useless feedback and just fails. I had a script to make the updates work before. Now I can't even launch the server.py. I'm probably done with this project. Rant over...

There are two primary speeds of speakers with SSML text with Silero. There is a fast group and a slow group. The difference in the max length of text each speaker can process is substantial. However, the maximum size that can be used without errors is dependent on three additional factors: the speed, the pitch, and the actual text all change the maximum length that can be generated. I tested several speakers and discovered the pattern that there are two styles or overall speeds in English speakers. You can actually hear the speed difference if you pay attention to the audio. Some speakers will sound faster at the same speed settings as others. I ended up comparing between en_10 and en_43, with en_10 being the faster speaker with a longer context length. As an aside, the faster speakers output significantly smaller wave audio file sizes too. It would be possible to try to adjust the maximum input text size using the a dictionary and then breakup the audio files accordingly, but every unique text has very different maximum lengths. So you will need to compensate for the speaker, the pitch, the speed, then leave a large margin. I have attached the CSV file with results from testing across the entire range of speed and pitch. I used a file that just had "oh oh oh oh oh..." and another that had the first 1500 words of "The Great Gatsby." I set the max length to test at 1315 for most of these. It is possible to go much longer but there is a pytorch error that must get handled if the sentences are longer than 1000 characters. It seems like there is a 1000 character error that happens somewhat randomly if the input is longer than ~1150 for the faster models. I had some configurations of the "Oh" text and the faster speaker that got past 1700 for max length.

silero-tts-pitch-speed-text.csv

I was working on a way to adapt this script but the changes here... https://github.com/S-trace/silero_tts_standalone/blob/master/tts.py

I think the best way forward is to name the speakers and categorize them by accents and skip the SSML text and all of the required formatting. This would make it much easier to create multiple speakers and join the audio together. This is what I really wanted to make. I wanted a scenario context with multiple additional character contexts that could get called and speak in series to make chat more like an agent that can have multiple uses and specializations.

Sascha353 commented 1 year ago

Another solution would be to stream text to the tts engine in chunks. I made feature request for that: https://github.com/oobabooga/text-generation-webui/issues/4706

erew123 commented 11 months ago

On the TTS board, there is a suggestion that infinite streaming will be built into the engine. https://github.com/coqui-ai/TTS/discussions/3197#discussioncomment-7586607

Not sure if they mean the next major version or revision.

github-actions[bot] commented 10 months ago

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

devzzzero commented 5 months ago

Hi. I am running into this issue repeatedly, this is with ooga version 8f12fb028dff4e133460fe10ef49d3f90167b313 Merge pull request #5970 from oobabooga/dev

@AndrewBenavides @Upcycle-Electronics

Thank you.

Quidam2k commented 5 months ago

I switched to AllTalk, which uses Coqui, and I've found it to be superior in nearly all ways. Not exactly a solution for you, but perhaps a workaround?

https://github.com/erew123/alltalk_tts

devzzzero commented 5 months ago

I switched to AllTalk, which uses Coqui, and I've found it to be superior in nearly all ways. Not exactly a solution for you, but perhaps a workaround?

https://github.com/erew123/alltalk_tts

Thank you for the reply. I'll give it a shot.

It works much better, indeed! Holy krap! James Earl Jones!

oobabooga / text-generation-webui