p0n1 / epub_to_audiobook

EPUB to audiobook converter, optimized for Audiobookshelf
MIT License
1.04k stars 107 forks source link

No breaks added between paragraphs when they don't end with a punctuation #43

Open briankendall opened 7 months ago

briankendall commented 7 months ago

When using Edge TTS to read an epub formatted book, the following sorts of paragraph won't be read correctly:

Chapter One

The Chapter Title

This is the first sentence of the chapter.

because it will be read as: "Chapter one the chapter title this is the first sentence of the chapter", as though it's all once sentence with no breaks. This can be especially confusing if there's a heading in a paragraph in the middle of a chapter, something like:

... This is the final sentence of a paragraph.

The Next Section

Here is another sentence.

since it'll be read as "The next section here is another sentence", making it easy to miss that the first half of that sentence was supposed to be a header.

I looked in the source code and the trouble seems to come from epub_book_parser.py, where the second text cleaning step replaces all groups of white space (including newlines) with a single space. So this might affect Azure and OpenAI TTS as well, but I haven't tested it.

At least in the case of Edge TTS, though, it's not sufficient to simply keep a newline in there, because it appears that the edge_tts module automatically replaces newlines with spaces as well. So I think the solution for it needs to include inserting periods where needed.

An even better solution for Edge TTS would be to insert longer pauses between such paragraphs, though since Microsoft prevents using SSML, it would require using something like this.

DavidAccola commented 7 months ago

+1 I came here to report this same issue. I am using OpenAI TTS.