rany2 / edge-tts

Use Microsoft Edge's online text-to-speech service from Python WITHOUT needing Microsoft Edge or Windows or an API key
https://pypi.org/project/edge-tts/
GNU General Public License v3.0
5.3k stars 544 forks source link

Long text strings produce incomplete audio files #190

Closed briankendall closed 4 months ago

briankendall commented 7 months ago

I'm trying to use edge-tts to convert a chapter of a book into an audiobook. It's about 39k characters and around 7500 words. When I run it through edge-tts, the resulting audio file is often incomplete. At what point in the text it cuts off seems to be inconsistent and arbitrary, and every now and then it successfully produces audio for the entire text.

Any idea what's going wrong? Is this even a use case that's expected to work? (I wonder if Microsoft is limiting how much audio it'll generate for one request.)

rany2 commented 7 months ago

I think it's related to an issue I've started encountering a month ago where the service randomly stops responding with audio data. It's a problem I've observed in the Edge browser as well.

I'm not sure how best to work around this but obviously a naive solution would be to retry a few times before accepting that the current split SSML doesn't have any audio data. Right now, if one of the split texts returns audio data; it doesn't raise an exception and considers it a success.

rany2 commented 7 months ago

Do you have any luck with the latest release (6.1.10)?

rany2 commented 7 months ago

Seems slightly better, but no luck. For context, I ran a generation 6 times on the a.txt file; there's about a megabyte or so missing in two of those files...

➜  edge-tts git:(master) ✗ wc a.txt
  1738  40238 269540 a.txt
➜  edge-tts git:(master) ✗ ls -lh *.mp3
-rw-r--r-- 1 user user 87M Feb 17 00:02 a.mp3
-rw-r--r-- 1 user user 87M Feb 16 23:55 b.mp3
-rw-r--r-- 1 user user 85M Feb 17 00:03 c.mp3
-rw-r--r-- 1 user user 86M Feb 17 00:03 d.mp3
-rw-r--r-- 1 user user 87M Feb 17 00:02 e.mp3
-rw-r--r-- 1 user user 87M Feb 16 23:55 f.mp3
➜  edge-tts git:(master) ✗ ls -l *.mp3 
-rw-r--r-- 1 user user 90403632 Feb 17 00:02 a.mp3
-rw-r--r-- 1 user user 90403632 Feb 16 23:55 b.mp3
-rw-r--r-- 1 user user 88253712 Feb 17 00:03 c.mp3
-rw-r--r-- 1 user user 89767152 Feb 17 00:03 d.mp3
-rw-r--r-- 1 user user 90403632 Feb 17 00:02 e.mp3
-rw-r--r-- 1 user user 90403632 Feb 16 23:55 f.mp3
expwise commented 7 months ago

6.1.10 stops running halfway through. I tested it with a 100,000-word text file and an MP3 with only around 80,000 words. However, 6.1.9 could run through the entire process. Yet, the subtitles generated by 6.1.9 only capture very little text.

expwise commented 7 months ago

The text file contains 250,000 words, while both the MP3 and the subtitles consist of only around 80,000 words. 6.1.10: image

expwise commented 7 months ago

I remember previously the data for generating the MP3 would incrementally increase until completion, but now it goes from 0 directly to completion. I'm not sure if this is the reason for the issue.

expwise commented 7 months ago

6.1.9: image

rany2 commented 7 months ago

@expwise Thanks for the info, I'll attempt a workaround in a bit. For the time being, I guess you'll need to stick to 6.1.9 as it works better somehow. It's worth mentioning that both have issues, it just seems like in your case 6.1.10 is worse....

My theory is that it has to do with the fact that 6.1.10 switches to the next chunk of ~64KiB text immediately without creating a new connection whereas 6.1.9 emulates the Microsoft Edge behavior of starting from a new connection.

expwise commented 7 months ago

@rany2 Thank you for your efforts. Your project has been of great help to me. Well done!

briankendall commented 6 months ago

What's the status of this issue? When I first reported it I was using 6.1.9.

rany2 commented 6 months ago

@briankendall It's more complicated than I expected, the issue is that sometimes their API returns audio output partially on the same connection. So I can't just have a check on whether the current connection returned any audio and if not, retry; it's more complicated....

briankendall commented 6 months ago

@rany2 Understood! I hope you can figure out a method for working around this.

lefnire commented 5 months ago

Maybe a workaround could be edge-tts to chunk text files into workable sizes, run them individually, then splice them back at the end?

rany2 commented 5 months ago

@lefnire we're doing that already, I tried different chunk sizes and I'm having the same issues regardless :(

lefnire commented 5 months ago

@rany2 aw bummer. Thanks for the reply. I just went through the gamut: tortoise-tts, coqui-ai/tts, bark, edge-tts. Edge was victorious; but for this one bug. Tortoise is unusably slow (but great realism). Coqui & Bark can't take large files, nor did I find their voices realistic. edge-tts shocked me in terms of realism and speed. Here's hoping there's a solution somehow! Huge bummer Edge browser doesn't support to-file, without weird hoops (recording audio-out overnight kinda deal).

rany2 commented 5 months ago

@lefnire not sure what you mean by to-file but you could actually save the mp3: https://github.com/rany2/edge-tts/blob/e58af9da76c7c7ba101c955ee1c2e98ce424f58f/examples/basic_generation.py#L19

lefnire commented 5 months ago

@rany2 right right, I meant it's a shame that Microsoft Edge Browser doesn't do this natively. Hence a big value-add of this project.

tschnibo commented 4 months ago

Hey People,

I am struggling with the same piece, also like @lefnire for audiobook generation. Earlier I was able to produce several books without problems, nowadays its a huge struggle.

But, I might have found some partial solution.

In my case (python 3.9 on mac) I received errrors with

asyncio.exceptions.TimeoutError

mostly either it produced some audio (often not complete), or it gave that error after a few seconds. _therefore I upped the 'receive_timeout' in 'communicate.py' from 5 to 9000

    def __init__(
        self,
        text: str,
        voice: str = "Microsoft Server Speech Text to Speech Voice (en-US, AriaNeural)",
        *,
        rate: str = "+0%",
        volume: str = "+0%",
        pitch: str = "+0Hz",
        proxy: Optional[str] = None,
        receive_timeout: int = 9000,
    ):

this inhibited the above mentioned error. But I still struggled with inclomplete audiofiles....

I then looked into the 'aiohttp.ClientSession' documentation, and found that there is a timeout of 300 seconds (5 minutes).

My audiofiles where around 20 MB each, when they stopped being produced, and it took often about 5 minutes. After some iteration, I too changed this to 9000 seconds. (150 minutes):

        # Create a new connection to the service.
        ssl_ctx = ssl.create_default_context(cafile=certifi.where())

        # By default aiohttp uses a total 300 seconds (5min) timeout, 
        # it means that the whole operation should finish in 5 minutes... (not long enough)
        # ... therefore we extend this quite a lot.
        timeout = aiohttp.ClientTimeout(total=9000)

        async with aiohttp.ClientSession(
            timeout = timeout,
            trust_env=True,
        ) as session, session.ws_connect(
            f"{WSS_URL}&ConnectionId={connect_id()}",

Since then it seems to work much better again – but not perfect!

I still get incomplete files, but less. What I observed for several files already: They got produced as incomplete 10 minutes after the initial creation of the file. This could hint towards a upper limit of the connection 'server-side' of 10 minutes. (the timeout could still be client-side). @rany2 I don't understand the software well enough. Is there an easy way to close the session after maybe 5 minutes and continue the text with a new session afterwards?

Disclaimer: I tried other things, e.g. reducing the threshold for the "chopping of the texts" from websocket_max_size: int = 2**16 to websocket_max_size: int = 2**12... this could have an effect too, but I don't think so. (as @rany2 already tested this anyways).

I also want to declare to not really understand the technicalities, and to have quite randomly selected the 9000 seconds.

As for the reason this problem occurs I want to post a guess for discussion: maybe microsoft started throttling the response/output, so it takes longer nowadays, as it did earlier (which is my impression in anycase), and therefore these timeouts do matter nowadays, despite not having mattered earlier.

@rany2 thank you a lot for your software – I really enjoy using it for my usecase, and listened to audiobooks created with this tool for many hours already.

rany2 commented 4 months ago

@tschnibo Thanks for researching and your kind words, I didn't know ClientSession had a timeout and never actually faced any timeout errors so I don't think it's related to this issue specifically. I'll try to look into your points to see if they get me any closer to a resolution.

It seems like the timeout value for ClientSession is a timeout for the entire operation, which seems like something we wouldn't want in this context because a generation might take a very long time. I'll most likely disable it all together and increase the receive_timeout to a minute.

As for the reason this problem occurs I want to post a guess for discussion: maybe microsoft started throttling the response/output, so it takes longer nowadays, as it did earlier (which is my impression in anycase), and therefore these timeouts do matter nowadays, despite not having mattered earlier.

Makes sense.

tschnibo commented 4 months ago

@rany2 Thank you for your friendly response!

to illustrate the unfinished files yesterday, after applying this changes, it looked like this:

Bildschirmfoto 2024-05-17 um 11 29 19

this timedifference between "created" and "last changed" of 10 minutes seems like a pattern.

disabling the ClientSession timeout seems like the right way to go, I totally agree. On the other hand, maybe one could define the timeout to be shorter than 10 minues, catch the timeout and proactively create a new session, or something like that – but this async-session-handling and OOP is not something which I easily see through – so I don't know what the easiest route would be. Maybe there is a way to just wait on the session to be terminated by the server, and then reconnect to a new session – but I don't know if this is actively communicated to the client by the server.

looking forward to watch the further development in this issue.

rany2 commented 4 months ago

Can someone test if the version in master (not the one released in pypi) still has this issue?

rany2 commented 4 months ago

Nevermind it's still inconsistent when it comes to this, but the first few runs were fine. I got my hopes up when it was working the first couple runs ):

tests/001-long-text_a.mp3 tests/001-long-text_b.mp3 differ: byte 781, line 3
tests/001-long-text_a.mp3 tests/001-long-text_g.mp3 differ: byte 27425332, line 110673
tests/001-long-text_a.srt tests/001-long-text_g.srt differ: byte 177684, line 5505
tests/001-long-text_a.mp3 tests/001-long-text_h.mp3 differ: byte 27425332, line 110673
tests/001-long-text_a.mp3 tests/001-long-text_m.mp3 differ: byte 781, line 3
tests/001-long-text_a.mp3 tests/001-long-text_r.mp3 differ: byte 781, line 3
tests/001-long-text_a.srt tests/001-long-text_r.srt differ: byte 175768, line 5441
tests/001-long-text_a.mp3 tests/001-long-text_z.mp3 differ: byte 781, line 3
tests/001-long-text_a.srt tests/001-long-text_z.srt differ: byte 87159, line 2693
tschnibo commented 4 months ago

@rany2 I really had to rise both of the timeouts much more, to only have this 10minutes timeout now. Have you tried with similarly excessive timeouts as I did?

maybe the first runs where "not further throttled" and then some sort of abuse-prevention on the server-side is activated, and this further slows the process?

rany2 commented 4 months ago

@tschnibo but there's no way that receive_timeout would be more than a minute? it's for sock recv... are you sure? The receive_timeout now is controlling the receive for the low-level socket not websocket

tschnibo commented 4 months ago

@rany2 to be honest, I have no clue. Just with my timeout setting, the mp3 file is produced for 10 minutes and then it is finished uncomplete. when I chose smaller timeout values, in the beginning, for receive_timeoutthen I still had this asyncio.exceptions.TimeoutError. But yes, because I changed both values, I am not 100% sure which one had which effect.

when you would be able to reproduce this 10 minutes phenomenology ,maybe this would be indicative of some mechanism.

tschnibo commented 4 months ago

with one of my examples I looked at the submitted text, and the produced .vtt file, and also some of the websocket (I think), messages.

and it stopped somewhere in the middle of the submitted text, with returning messages:

{'type': 'WordBoundary', 'offset': 35077000000, 'duration': 1500000, 'text': 'that'} {'type': 'WordBoundary', 'offset': 35078625000, 'duration': 1500000, 'text': 'have'} {'type': 'WordBoundary', 'offset': 35080250000, 'duration': 6500000, 'text': 'significant'} {'type': 'WordBoundary', 'offset': 35086875000, 'duration': 7750000, 'text': 'implications'} {'type': 'WordBoundary', 'offset': 35094750000, 'duration': 1125000, 'text': 'for'}

... and then it starts with the next text, for the next file.

I think it would be interesting to monitor the connection and see if there is some sort of termination message.

rany2 commented 4 months ago

see if there is some sort of termination message

There isn't unfortunately :(

Just with my timeout setting, the mp3 file is produced for 10 minutes and then it is finished uncomplete. when I chose smaller timeout values, in the beginning

Could you test the current version in master and see if you still get timeouts? The parameter now sets a timeout for socket recv, previously it was controlling the time it needs to get a websocket message response.

tschnibo commented 4 months ago

Yes, I'll try to test... just doing this besides working a completely different job, cannot plan on when I accomplish the testing.

tschnibo commented 4 months ago

@rany2 in order to make my task easier, I patched my existing installation with your changes, I hope I have done this correctly – the first few files went flawlessly, but now, also the chapters are maybe getting longer again (or some throttling kicks in a again), it just had displayed this 10 min cutoff again, with the unfinished processing.

The next chapter went alright again (with a 34 MB audio generated, in 4 minutes), the next one cancelled after 10 minutes and 18 MB again... as did the next few chapters, until a much shorter chapter, which completed fine.

so for me it seems like the behavior stays the same as with my extended timeouts, in terms of the files either being correctly (and maybe rather quickly) generated, or the process (is slower and) quits after 10 minutes for large texts, and might be successfull for shorter texts.

I didn't have any timeout-errors like in the pypi version...

rany2 commented 4 months ago

@tschnibo so you're saying that the defaults right now don't need any adjusting?

tschnibo commented 4 months ago

@rany2 I am not quite sure if I understand your question correctly. With your master-version I don't have these timeout-error-messages, like when I adjusted the timeouts myself – but the "unfinished" audio for long texts still occurs. Does that answer your question?

rany2 commented 4 months ago

@tschnibo yep, thank you. I just wanted to know that the timeout values in master are fine now.

tschnibo commented 4 months ago

@rany2 I am again working, and did no extended testing, but for the one audiobook (a different one than yesterday) it looks like it, yes!

tschnibo commented 4 months ago

@rany2 sorry, I misjudged, at a closer look I again discover these:

Traceback (most recent call last): File "/opt/homebrew/bin/edge-tts", line 8, in sys.exit(main()) File "/opt/homebrew/lib/python3.9/site-packages/edge_tts/util.py", line 139, in main loop.run_until_complete(amain()) File "/opt/homebrew/Cellar/python@3.9/3.9.19/Frameworks/Python.framework/Versions/3.9/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete return future.result() File "/opt/homebrew/lib/python3.9/site-packages/edge_tts/util.py", line 132, in amain await _run_tts(args) File "/opt/homebrew/lib/python3.9/site-packages/edge_tts/util.py", line 65, in _run_tts async for chunk in tts.stream(): File "/opt/homebrew/lib/python3.9/site-packages/edge_tts/communicate.py", line 445, in stream async for received in websocket: File "/opt/homebrew/lib/python3.9/site-packages/aiohttp/client_ws.py", line 312, in anext msg = await self.receive() File "/opt/homebrew/lib/python3.9/site-packages/aiohttp/client_ws.py", line 244, in receive msg = await self._reader.read() File "/opt/homebrew/lib/python3.9/site-packages/aiohttp/streams.py", line 663, in read return await super().read() File "/opt/homebrew/lib/python3.9/site-packages/aiohttp/streams.py", line 622, in read await self._waiter aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket

and this didn't occur with my timeout values.

only one occurence so far, and it could also be because I disconnected the notebook from my phone for some time, or something like that – I didn't monitor close enough.

I'll run it again, and share the experience...

edit: so the next run was without such timeout messages... but with some 10 minutes-maxed-out-outputs.

kovaacs commented 4 months ago

Just wondering, can the subtitles returned by the server used as a sanity check for data completeness? So, if the subtitles returned do not match the text sent, assume all data (including audio) to be incomplete. I haven't looked into it, but I can if you think it's not a stupid idea.

rany2 commented 4 months ago

@kovaacs It's not a stupid idea and I did consider it but the issue is that some characters are ignored by TTS depending on the voice selection (i.e., if you send Chinese characters to an English voice it will just ignore it; so the issue is that you need to somehow figure out all the values every voice takes)

kovaacs commented 4 months ago

@rany2 I'm thinking maybe fuzzy matching could be an option. You compare the sent data with what was received, and see how similar they are. It could be an opt-in flag, e.g.--min-confidence 0.8, meaning if the similarity is below 80%, consider the chunk a failure and retry it. It'd be up to the user to choose their similarity score and also to ensure that they don't send garbage data that would skew the score.

rany2 commented 4 months ago

@kovaacs Seems like it's more trouble than it's worth to be honest. I think as a workaround it would probably work but I'm not willing to implement it myself.

tschnibo commented 4 months ago

I use the TTS output to manually check if the output is complete for the current chapter, or not. For me the question would be, what do you do with this information? Do you restart the session?

I don't know yet how to implement this session restart. If I would, then I would maybe just chop the texts into bits which don't take too long to be produced, and give it a session-restart time of about 5 minutes, in order to have a "safety-margin".

so just check how old the session is, before sending the next chunk to be converted, and if it is above 5 minutes, restart the session first.

Do you see how this could be implemented?

rany2 commented 4 months ago

I don't think it's that simple, it already divides the text into chunks and starts new sessions (new session or reusing old connection makes no diff); the issue is that I receive incomplete audio data from the service so I cannot be sure if the data is complete or not by just looking at whether there is audio or not.

Also the 5 minutes thing is not really reliable, I've had it happen in my test within 2 mins; the trouble is that it is inconsistent and there doesn't seem to be a pattern I could find.

rany2 commented 4 months ago

I think I've found a solution, it seems like an off-by-one error on my end and the fix I initially tried would have worked; I'll keep you guys posted :')

rany2 commented 4 months ago

Please test latest master (make sure to include https://github.com/rany2/edge-tts/commit/580f880bdaf34aced24f4badae8e025563eb2844)

kovaacs commented 4 months ago

@rany2 n = 1, but I've just exported a 14+ hour long file without problems. Thanks for fixing it so quickly, you saved me quite a lot of manual effort.

mobad commented 3 months ago

@rany2 I'm working on a different Edge TTS client for Android and I'm running in to a similar issue where we intermittently receive no data or timeout. I was taking a look at your change https://github.com/rany2/edge-tts/commit/580f880bdaf34aced24f4badae8e025563eb2844 but I'm having a hard time seeing what the actual fix was. It looks like it can better detect the incomplete audio and throw an exception but I don't see any retry mechanism. (Unless it's some python magic I'm missing) Would it be possible for you be able to point out what the underlying issue was and what was the fix? Thanks!

briankendall commented 2 months ago

Just popping in to say that this is indeed fixed. Thanks!

wslyyy commented 2 months ago

@rany2 I'm working on a different Edge TTS client for Android and I'm running in to a similar issue where we intermittently receive no data or timeout. I was taking a look at your change 580f880 but I'm having a hard time seeing what the actual fix was. It looks like it can better detect the incomplete audio and throw an exception but I don't see any retry mechanism. (Unless it's some python magic I'm missing) Would it be possible for you be able to point out what the underlying issue was and what was the fix? Thanks!

@rany2 The same question

rany2 commented 2 months ago

@mobad @wslyyy without source code its hard to tell, but there are many factors that might cause this to happen. The annoying bit is that Microsoft does not return any errors but instead just silently fails either completely or for some chunks.