pndurette / gTTS

Python library and CLI tool to interface with Google Translate's text-to-speech API
http://gtts.readthedocs.org/
MIT License
2.24k stars 358 forks source link

No pauses if 100 characters limit #396

Closed Rom888 closed 1 month ago

Rom888 commented 1 year ago

Great project! Is it possible that gtts does not make audio pauses between tokens if those tokens were created due to the 100 characters limit?

pndurette commented 1 year ago

Hi @Rom888, that's a though one! So the upstream API will introduce a break after 100 characters and there's no way to control this. Which is why gTTS tries to pre-emptively split (tokenize) where pauses would naturally occur (e.g. punctuation) to remediate this, which works pretty well most of the time. But if your input is more than 100 characters w/o any break that gTTS' tokenizer could use to split on, there will have a break no matter what.

Edit: So your best bet if you control the input is to introduce punctuation (commas, etc.).

Edit 2: Wondering if I understood your question correctly actually. Do you have an example where this occurs?

Rom888 commented 1 year ago

Here is an example:

split-string-1 <split by tokenizer>
split-string-2 <split by tokenizer>
split-string-3 <split by minimizer (because larger than 100 characters)>
split-string-4 <split by tokenizer>

If I understand correctly, gTTS gets all the split strings, makes audio, and then joins all audio fragments into one and adds pauses between those audio fragments.

Is it possible to not add pauses between audio fragments 3 and 4 when joining?

pndurette commented 1 year ago

@Rom888 Sorry for the delay—

So what you said is almost correct. gTTS splits the strings (where the speech would typically pause), then generate that audio, and puts the audio bits together. It doesn't add any breaks in the audio because it doesn't have to. It's only the natural break happening between the end of an audio phrase and the next.

So to answer your question, it's not something we can easily control other than by changing the text that is sent, i.e. with some punctuation, to make it sound at least more natural.

Rom888 commented 1 year ago

Okay, do you think we can add an option to gtts-cli, for example: --cut-if-minimizer=500ms and cut the end of the audio, if that audio was because of minimizer? (the audio from split-string-3 in the example above).

pndurette commented 1 year ago

Sorry for the delay— Hmm, that would be pretty hard. Pretty much the same conclusion to what I wrote in https://github.com/pndurette/gTTS/issues/398#issuecomment-1491326989. This library has no knowledge of the data it gets (audio, words, timing), it just saves it to a file.

keisanng commented 3 months ago

If there's consistent pauses you could do some post-processing with MoviePy or FFmpeg on the generated audio to trim them off.