p0n1 / epub_to_audiobook

EPUB to audiobook converter, optimized for Audiobookshelf
MIT License
1.16k stars 119 forks source link

Parsing issue in edge_tts #79

Open rhsanborn opened 4 months ago

rhsanborn commented 4 months ago

File: tts_providers/edge_tts_provider.py

I was processing a file and ran into weird error: ValueError: invalid literal for int() with base 10: 'A bunch of book text....'

I tracked it down to line 57 in edge_tts_provider.py. In essence, it hits the first pause, then, for every chunk is asks "is there a close bracket in this chunk". If there is, assume the text preceding that is the pause time. That works for lots of text, but if your book as brackets in it, then you end up stumbling on random brackets that are not associated with the pause time. Now it's using regex to find the beginning of the string, then any length of digits, followed by a close bracket.

Here's the fix I put in, it requires importing re:

    for part in parts:
       # if "]" in part:
        if re.search(r'^\d*]', part):
            pause_time, content = part.split("]", 1)
            yield int(pause_time), content.strip()

I'm not sure if this fixes all edge cases, but it got me past this one.

Thanks for the awesome tool. I'm super psyched to have the Edge TTS and not be paying the equivalent of a produced ebook for the Azure credits!!

p0n1 commented 4 months ago

Thanks for pointing out and share you fix. Yeah, there was a bug so I tried to fix it in https://github.com/p0n1/epub_to_audiobook/pull/71. Not sure if you're using the latest code. Let me know if the latest version solves your issue.