Closed 2kbits closed 10 months ago
I don't I think this is a bug necessarily, this happens to me too and I think it's just the intrinsic property of the extension.
I'm not entirely sure how it works as I'm still new to working with public projects but even so I was still hoping we could break up the text into sizable chunks for the extension instead of it just breaking randomly I'm not sure how feasible that is though.
I'm half tempted to throw the extension's code into GPT and see if it's got any ideas on how to modify it to generate the speech in batches every time it sees the newline character, but I'd love it if someone beat me to it.
Here is a hack for use in the interm (just replace the output_modifier method in script.py with this one). I am arbitrarily checking the raw string length, if it is too large, I am splitting the output string into sentences. At an arbitrary length, I am cutting off the output and then eliminating the partial sentence that this intersects before rebuilding the string at the smaller length. I also print the part of the string that was truncated at the arbitrary length in the terminal (but not the portion of this sentence from before the split). This only impacts the Silero audio, so you'll still see the whole text output in the WebUI. Ideally, this link has a better potential fix (https://github.com/snakers4/silero-models/pull/174) but I still have not been able to get it fully integrated yet. I have not been able to save the output from the Textgen Silero extension as a tensor to pass it to the methods in the BigTextToAudio class to use torch cat. I had something working standalone a couple of weeks ago that used a temp file on the system, this then generated a series of wave files and then concatenated them afterwards, but I can't seem to find where I put that and my sloppy mess was not portable or worth sharing. I'm just giving ideas. The hack below could be done better but it works just to stop the freezing issue.
def output_modifier(string, state):
global model, current_params, streaming_state
for i in params:
if params[i] != current_params[i]:
model = load_model()
current_params = params.copy()
break
if not params['activate']:
return string
original_string = string
string = tts_preprocessor.preprocess(html.unescape(string))
if string == '':
string = '*Empty reply, try regenerating*'
# elif len(string) > 800:
# print("Debug Test 1: calling generate_audio(string)")
# generate_audio(string)
else:
output_file = Path(f'extensions/silero_tts/outputs/{state["character_menu"]}_{int(time.time())}.wav')
prosody = '<prosody rate="{}" pitch="{}">'.format(params['voice_speed'], params['voice_pitch'])
silero_input = f'<speak>{prosody}{xmlesc(string)}</prosody></speak>'
if len(silero_input) > 900: # added the following few lines to truncate long failing outputs
print("truncating the following + the last split sentence = ", silero_input[900:])
sentences = silero_input[:900].rstrip().split('.')
silero_input = ''.join([sentence + '.' for sentence in sentences[:-1]])+"</prosody></speak>"
model.save_wav(ssml_text=silero_input, speaker=params['speaker'], sample_rate=int(params['sample_rate']), audio_path=str(output_file))
autoplay = 'autoplay' if params['autoplay'] else ''
string = f'<audio src="file/{output_file.as_posix()}" controls {autoplay}></audio>'
if params['show_text']:
string += f'\n\n{original_string}'
shared.processing_message = "*Is typing...*"
return string
@Upcycle-Electronics I hope you don't mind, but I fiddled around with your hack and was able to get generate slices of the original message and concatenate the wave files together into one complete output of the message. I shortened the slice length to 800 characters because I was still hitting 'too long' errors occasionally at 900. Also, apologies if my code seems janky -- Python isn't my primary language and I didn't have a proper debug/test cycle set up.
import wave
def output_modifier(string, state):
global model, current_params, streaming_state
for i in params:
if params[i] != current_params[i]:
model = load_model()
current_params = params.copy()
break
if not params['activate']:
return string
original_string = string
string = tts_preprocessor.preprocess(html.unescape(string))
if string == '':
string = '*Empty reply, try regenerating*'
else:
xmlesc_string = xmlesc(string)
prosody = '<prosody rate="{}" pitch="{}">'.format(params['voice_speed'], params['voice_pitch'])
output_file = Path(f'extensions/silero_tts/outputs/{state["character_menu"]}_{int(time.time())}.wav')
wave_data = []
slice = 800
while len(xmlesc_string) > 0:
silero_open_tags = f'<speak>{prosody}'
silero_close_tags = '</prosody></speak>'
silero_input = f'{silero_open_tags}{xmlesc_string}{silero_close_tags}'
if (len(silero_input) > slice):
print("truncating the following + the last split sentence =", silero_input[slice:])
sentences = silero_input[:slice].rstrip().split('.')
silero_input = ''.join([sentence + '.' for sentence in sentences[:-1]])+"</prosody></speak>"
string_len = len(silero_input) - len(silero_open_tags) - len(silero_close_tags)
xmlesc_string = xmlesc_string[string_len:]
model.save_wav(ssml_text=silero_input, speaker=params['speaker'], sample_rate=int(params['sample_rate']), audio_path=str(output_file))
with wave.open(str(output_file), 'rb') as w:
wave_data.append( [w.getparams(), w.readframes(w.getnframes())] )
Path.unlink(output_file)
with wave.open(str(output_file), 'wb') as output:
output.setparams(wave_data[0][0])
for i in range(len(wave_data)):
output.writeframes(wave_data[i][1])
autoplay = 'autoplay' if params['autoplay'] else ''
string = f'<audio src="file/{output_file.as_posix()}" controls {autoplay}></audio>'
if params['show_text']:
string += f'\n\n{original_string}'
shared.processing_message = "*Is typing...*"
return string
@AndrewBenavides You will still have potential problems. I did some testing and the real problem is the SSML input text mode for Silero is not good to use. The extension should just use the regular voices available and send plain text strings this would greatly simplify the extension and make it much easier to extend. I was working on this but I am unable to overcome the update now. The webui won't work, The update fails because I have a dozen modified files and the linux update script has completely useless feedback and just fails. I had a script to make the updates work before. Now I can't even launch the server.py. I'm probably done with this project. Rant over...
There are two primary speeds of speakers with SSML text with Silero. There is a fast group and a slow group. The difference in the max length of text each speaker can process is substantial. However, the maximum size that can be used without errors is dependent on three additional factors: the speed, the pitch, and the actual text all change the maximum length that can be generated. I tested several speakers and discovered the pattern that there are two styles or overall speeds in English speakers. You can actually hear the speed difference if you pay attention to the audio. Some speakers will sound faster at the same speed settings as others. I ended up comparing between en_10 and en_43, with en_10 being the faster speaker with a longer context length. As an aside, the faster speakers output significantly smaller wave audio file sizes too. It would be possible to try to adjust the maximum input text size using the a dictionary and then breakup the audio files accordingly, but every unique text has very different maximum lengths. So you will need to compensate for the speaker, the pitch, the speed, then leave a large margin. I have attached the CSV file with results from testing across the entire range of speed and pitch. I used a file that just had "oh oh oh oh oh..." and another that had the first 1500 words of "The Great Gatsby." I set the max length to test at 1315 for most of these. It is possible to go much longer but there is a pytorch error that must get handled if the sentences are longer than 1000 characters. It seems like there is a 1000 character error that happens somewhat randomly if the input is longer than ~1150 for the faster models. I had some configurations of the "Oh" text and the faster speaker that got past 1700 for max length.
silero-tts-pitch-speed-text.csv
I was working on a way to adapt this script but the changes here... https://github.com/S-trace/silero_tts_standalone/blob/master/tts.py
I think the best way forward is to name the speakers and categorize them by accents and skip the SSML text and all of the required formatting. This would make it much easier to create multiple speakers and join the audio together. This is what I really wanted to make. I wanted a scenario context with multiple additional character contexts that could get called and speak in series to make chat more like an agent that can have multiple uses and specializations.
Another solution would be to stream text to the tts engine in chunks. I made feature request for that: https://github.com/oobabooga/text-generation-webui/issues/4706
On the TTS board, there is a suggestion that infinite streaming will be built into the engine. https://github.com/coqui-ai/TTS/discussions/3197#discussioncomment-7586607
Not sure if they mean the next major version or revision.
This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.
Hi. I am running into this issue repeatedly, this is with ooga version 8f12fb028dff4e133460fe10ef49d3f90167b313
Merge pull request #5970 from oobabooga/dev
@AndrewBenavides @Upcycle-Electronics
Thank you.
I switched to AllTalk, which uses Coqui, and I've found it to be superior in nearly all ways. Not exactly a solution for you, but perhaps a workaround?
I switched to AllTalk, which uses Coqui, and I've found it to be superior in nearly all ways. Not exactly a solution for you, but perhaps a workaround?
Thank you for the reply. I'll give it a shot.
It works much better, indeed! Holy krap! James Earl Jones!
Describe the bug
While using the Silero TTS extension, I encountered an error when providing long text inputs. The model seems to have a limitation on the length of the input text it can handle.
Expected Behavior: The extension should be able to handle longer text (maybe break it up into batches?) and generate corresponding audio without any errors.
Actual Behavior: Received an error indicating that the text string is longer than 1000 symbols and the chat output is stuck at "Is recording a voice message..."
Possible Solution: Maybe not the best approach but one way could be to break the input text into smaller chunks, process each chunk with Silero, and then concatenate the audio outputs.
Additional Information:
Would appreciate any guidance or fixes for this issue. Thank you!
Is there an existing issue for this?
Reproduction
Steps to Reproduce:
text-generation-webui
repository.Screenshot
Logs
System Info