SSML fails a shocking 22% of the time

kylefoley76 commented 3 years ago

I have already posted this problem on stackexchange but I seriously doubt anyone will solve it there.

Environment details

OS type and version: Mac OS 10.14.4
Python version: 3.8.0
pip version: 20.1.1 (but I downloaded texttospeech a long time ago)
google-cloud-texttospeech version: 2.2.0

I've had this problem since I began using Google's text to speech, now I'm determined to fix it. 22% of the time, the SSML language will not work and a text will be rendered without pauses for no reasons that I'm aware of. I really wish Google would just put the pauses in automatically for me. An audio text without pauses is virtually unlistenable. In short the program will ignore the syntax

But it will only do this for some of the texts. I should also add that I divide the text up into chunks of I think 3000 characters and the software will either obey all of the break times for that text or 22% of the time it will ignore all of the break times for that text.

Also, it is not entirely deterministic that a given text will result in a audio without pauses. However, there is a very strong probability that a given text will result in an audio without pauses.

I redid the first 100 chunks of text. A shocking 22% of them failed the first time. The second time I did them 4 which previously failed succeeded and 2 which previously succeeded failed. So it cannot be the case that it is the text itself which is causing the failure since the exact same text will sometimes fail and sometimes succeed.

The exact text I converted into audio is located here:

problematic texts

Each text is preceded by a number surrounded by __ . On both occasions the following chunks failed both times:

16 41 46 58 59 61 65 74 80 85 86 87 90 91 92 94 95 96 97 98

The following chunks failed once out of two tries

40 45 47 81 92 89

I also put the code to sleep for 3 seconds between each transcription. I've also tried 7 seconds which yielded no differences.

Here is the code I'm using:

from google.cloud import texttospeech
str1 = 'my_credentials.json"
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = str1
client = texttospeech.TextToSpeechClient()
input_text = texttospeech.SynthesisInput(ssml=txt1)
voice = texttospeech.VoiceSelectionParams(
    language_code='en-US',
    name='en-US-Wavenet-C',
    ssml_gender=texttospeech.SsmlVoiceGender.FEMALE,
)

audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.MP3,
    speaking_rate=.85
    )

response = client.synthesize_speech(input=input_text,
                                    voice=voice,
                                    audio_config=audio_config)
with open(f'{self.folder}{idx}.mp3', 'wb') as out:
    out.write(response.audio_content)

munkhuushmgl commented 3 years ago

I have sent a request to view your text file u have shared.

kylefoley76 commented 3 years ago

I meant to set it to 'anyone with link' can read when I uploaded. That is done now.

munkhuushmgl commented 3 years ago

@kylefoley76 Are you processing this provided text file as one big text file ? treating each line number as one input text?

On my end I used one line as one file input, looks like Google TTS API skipping <break time="20s">

Is that what u r referring to? (meaning not pausing)

kylefoley76 commented 3 years ago

This question is answered in the OP when I wrote

Each text is preceded by a number surrounded by __ . On both occasions the following chunks failed both times:

16 41 46 58 59 61 65 74 80 85 86 87 90 91 92 94 95 96 97 98

The following chunks failed once out of two tries

40 45 47 81 92 89

Google TTS API is skipping break time="20s"> is what I am referring to. Do you know how to solve this problem?

munkhuushmgl commented 3 years ago

Looks like it is known issue, I am investing. I will update this thread once I get any update

kylefoley76 commented 3 years ago

Please do this. This problem is so serious that I will have to shop around for a new client if it cannot be solved.

kylefoley76 commented 3 years ago

The software is basically unusable if it cannot be solved, for books at least. You can use for little things but I want it for converting books. 22% failure rate is not acceptable and listening to text without pauses is not possible.

kylefoley76 commented 3 years ago

I have a term paper write and a deadline. I need to convert this book to an acceptable audio file within 3 days. If the bug cannot be fixed within 3 days then I'm moving over to Amazon or Microsoft.

munkhuushmgl commented 3 years ago

@kylefoley76 I was able to get <break time=20s> working for me.

The problem was using the following param causes the API to skip break tags

    name='en-US-Wavenet-C',

Here it is sample that I used from your input and snippet code I customzied from your code.

code:

from google.cloud import texttospeech
client = texttospeech.TextToSpeechClient()
with open("resources/test.ssml", "r") as f:
    ssml = f.read()
    input_text = texttospeech.SynthesisInput(ssml=ssml)

voice = texttospeech.VoiceSelectionParams(
    language_code='en-US',
    ssml_gender=texttospeech.SsmlVoiceGender.FEMALE,
)

audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.MP3,
    )

response = client.synthesize_speech(input=input_text,
                                    voice=voice,
                                    audio_config=audio_config)
# with open(f'{self.folder}{idx}.mp3', 'wb') as out:
#     out.write(response.audio_content)
#     # The response's audio_content is binary.
with open("rerpo.mp3", "wb") as out:
    out.write(response.audio_content)
    print('Audio content written to file "rerpo.mp3"')

SSML file used: (I have deleted and 0___ to get clear output)

https://docs.google.com/document/d/1NcwlMooH4uDJsLTtPBwFgo5BI5I4jiF8VdfP9BwGae4/edit?usp=sharing

kylefoley76 commented 3 years ago

No, that didn't do it. I think you misread my post. Your doc refers to the chunk 0. The chunk 0 is working fine. It's chunk 16, 40, 41, 56, 58 etc etc that are not working and as I already said some chunks work 50% of the time. You have to test chunk. I posted all chunks from 0 - 100 to demonstrate that the problem does not appear to be a problem with the text, since I see no difference between the working chunks and the faulty chunks.

kylefoley76 commented 3 years ago

Let me report what I wrote before

Each text is preceded by a number surrounded by __ . On both occasions the following chunks failed both times:

16 41 46 58 59 61 65 74 80 85 86 87 90 91 92 94 95 96 97 98

The following chunks failed once out of two tries

40 45 47 81 92 89

munkhuushmgl commented 3 years ago

@kylefoley76 I ran the following text 40 3 times

 <speak> on the night in 1814 when he declared his love for her<break time="10s"/> By 1809<break time="0.4s"/> when Shelley was 17<break time="0.4s"/> the friendship with Lind had brought out the first clear marks of dawning intellectual maturity<break time="0.4s"/> and Eton had begun to smooth some of its expensive social polish over the disturbed and volatile personality beneath<break time="0.8s"/> His sisters remembered their ‘silent<break time="0.4s"/> though excessive’ admiration as their elder brother stood in beautifully fitting silk pantaloons warming his coat-tails in front of the massive fire at Field Place<break time="0.8s"/> To an Eton friend he wrote nonchalantly of shooting at thousands of wild ducks and geese ‘in our River and Lake’ all day<break time="0.4s"/> and reading novels and romances all night<break time="0.8s"/> Tom Medwin saw him down three snipe in three successive shots at the end of the pond<break time="0.8s"/> In another letter he issued an invitation in the Eton style of the day<break time="0.4s"/> ‘I hope we shall have the Pleasure of your Company at Field Place at Easter<break time="0.4s"/> & that you will conjointly with Il Padre & myself esclipse the Beau’s & Belles of the Horsham Ball<break time="0.8s"/> O how I wish you were here to enliven our Provincial Stupidity & how I regret the Frost — I am your affectionate friend<break time="0.8s"/>’ It is notable that his father is still referred to in a markedly amiable light<break time="0.8s"/> A new attraction who entered briefly but intensely into his life in 1809 was his beautiful cousin<break time="0.4s"/> Harriet Grove<break time="0.4s"/> who came to stay with the rest of her family at Field Place during the spring<break time="0.8s"/> His new-found appetite for postal debates had led him to write to his cousin as well<break time="0.4s"/> and in her diary between January and April 1809 she recorded the receipt of weekly letters<break time="0.8s"/> The friendship began on paper before it began in fact<break time="0.4s"/> and it was to retain a novelettish quality throughout the next eighteen months<break time="0.8s"/> Shelley first properly met Harriet in April<break time="0.4s"/> at a time when he described himself immersed in solitude at Field Place<break time="0.4s"/> having ‘no Employment<break time="0.4s"/> except writing Novels & Letters’<break time="0.8s"/> His youngest sisters were now themselves away at school<break time="0.8s"/> The descent of the Grove family for several days cheered him up and excited him<break time="0.4s"/> an inseparable romantic foursome was formed of Shelley<break time="0.4s"/> Harriet<break time="0.4s"/> her closest brother Charles and Elizabeth Shelley<break time="0.8s"/> There were moonlit walks to Strood — Shelley’s old favourite — and twilit rambles round St Irving’s<break time="0.4s"/> a beautiful Elizabethan manor house near Horsham with gardens and fountains laid out by Capability Brown<break time="0.8s"/> Shelley<break time="0.4s"/> Harriet and Elizabeth planned poems and novels together<break time="0.4s"/> </speak>

munkhuushmgl commented 3 years ago

@kylefoley76

Looks like your ssml contains

if it contains this symbol, all the breaks are skipped. If you remove & from 40 text, it is working. I ran 6 times to be exact.

REference:

https://github.com/googleapis/python-texttospeech/blob/420eb87aa22756376a14f46d579db1b518f615da/samples/snippets/ssml_addresses.py#L89-L93

kylefoley76 commented 3 years ago

Thanks that did it.

googleapis / python-texttospeech

SSML fails a shocking 22% of the time #100

Environment details