rany2 / edge-tts

Use Microsoft Edge's online text-to-speech service from Python WITHOUT needing Microsoft Edge or Windows or an API key
https://pypi.org/project/edge-tts/
GNU General Public License v3.0
4.21k stars 444 forks source link

Problem for playing stream audio instantly using pydub #187

Closed omicroh closed 1 month ago

omicroh commented 4 months ago

Hello there,

I've been trying to implement fast audio streaming for 6 days now, but I just can't do it. Indeed, without the stream method, the save function of edge-tts takes about 1-2s to generate depending the text, which is too long.

In my code bellow, the audio is indeed played instantly, regardless of text size, but there are artefacts between chunks, like tiny spaces.

Do you know how to do audio streaming correctly please ? Thank you!

import asyncio import edge_tts import pyaudio from io import BytesIO from pydub import AudioSegment

TEXT = "Hello World!" VOICE = "en-GB-SoniaNeural"

p = pyaudio.PyAudio()

async def stream_tts(text: str, voice: str) -> None:

# We're assuming a certain format, channels, and rate
# This will need to be dynamic based on the actual audio data from TTS
stream = p.open(
    format=pyaudio.paInt16,
    channels=1,
    rate=26000,
    output=True
)

communicate = edge_tts.Communicate(text, voice)

# Process and play audio chunks as they arrive
async for chunk in communicate.stream():
    if chunk["type"] == "audio":
        try:
            audio_segment = AudioSegment.from_file(BytesIO(chunk["data"]), format="mp3")

            # Write data to the stream directly without extra buffering
            stream.write(audio_segment.raw_data)

            # If this is the last chunk, break after playing
            if chunk.get('end', False):
                break
        except Exception as e:
            print("Error processing audio chunk:", e)

# Cleanup
stream.stop_stream()
stream.close()
p.terminate()

if name == "main":

Run the asyncio event loop

asyncio.run(stream_tts(TEXT, VOICE))

@rany2

rany2 commented 4 months ago

Maybe it is supposed to be like this? I'm sorry but I don't use pydub but the p.open line seems wrong.

stream = p.open(
    format=pyaudio.paInt16,
    channels=1,
    rate=48000,
    output=True
)
Phouter0499 commented 2 months ago

So I got the audio streaming working. Here is the code:

import asyncio
import edge_tts
from io import BytesIO
from pydub import AudioSegment
from pydub.playback import play

TEXT = "To Sherlock Holmes she is always the woman. I have seldom heard him mention her under any other name."
VOICE = "en-US-AndrewMultilingualNeural"

async def amain() -> None:
    communicate = edge_tts.Communicate(TEXT, VOICE)
    async for chunk in communicate.stream():
        if chunk["type"] == "audio":
            try:
                buffer = BytesIO()
                buffer.write(chunk["data"])
                buffer.seek(0)
                audio_segment = AudioSegment.from_mp3(buffer)
                play(audio_segment)
                # If this is the last chunk, break after writing to buffer
                if chunk.get('end', False):
                    break
            except Exception as e:
                print("Error processing audio chunk:", e)   

if __name__ == "__main__":
    asyncio.run(amain())
Phouter0499 commented 2 months ago

just tested with even bigger text (chapter 1 of sherlock holmes) but i notice that the stream is not instantaneous. maybe the online service processes all of the text first before they start streaming???

rany2 commented 1 month ago

@Phouter0499 maybe, but I'm sending you the chunks as soon as I receive them; so whatever it is it's an issue on Microsoft's end

FerLuisxd commented 1 month ago

The audio breaks in @Phouter0499 example, is it expected? :/

rany2 commented 1 month ago

@FerLuisxd If you're using large text please try the version in master (not pypi), recently #190 was fixed but I didn't make a release yet.

FerLuisxd commented 1 month ago

It is a short text, I managed to improve the speed a bit but it still does not feel right. Here is the updated code

import asyncio
import edge_tts
import pyaudio
from io import BytesIO
from pydub import AudioSegment

TEXT = "Hello World!"
VOICE = "en-US-AndrewMultilingualNeural"

async def amain() -> None:
    communicate = edge_tts.Communicate(TEXT, VOICE)
    p = pyaudio.PyAudio()
    stream = p.open(format=pyaudio.paInt16,
                    channels=1,
                    rate=24000,
                    output=True)
    async for chunk in communicate.stream():
        if chunk["type"] == "audio":
            try:
                stream.write(AudioSegment.from_file(BytesIO(chunk["data"]), format="mp3").raw_data)
            except Exception as e:
                print("Error processing audio chunk:", e)

    stream.stop_stream()
    stream.close()
    p.terminate()

if __name__ == "__main__":
    asyncio.run(amain())

Here is @Phouter0499 version:

import asyncio
import edge_tts
from io import BytesIO
from pydub import AudioSegment
from pydub.playback import play

TEXT = "Hello World!"
VOICE = "en-US-AndrewMultilingualNeural"

async def amain() -> None:
    communicate = edge_tts.Communicate(TEXT, VOICE)
    async for chunk in communicate.stream():
        if chunk["type"] == "audio":
            try:
                buffer = BytesIO()
                buffer.write(chunk["data"])
                buffer.seek(0)
                audio_segment = AudioSegment.from_mp3(buffer)
                play(audio_segment)
            except Exception as e:
                print("Error processing audio chunk:", e)   

if __name__ == "__main__":
    asyncio.run(amain())

If you test both codes it still feels like it is reading it letter by letter, I'm wondering if there is a way so it can read it faster? πŸ€”

FerLuisxd commented 1 month ago

I wonder if @omicroh managed to find a solution as well

rany2 commented 1 month ago

FYI there is no such thing as:

                # If this is the last chunk, break after writing to buffer
                if chunk.get('end', False):
                    break

I'm not sure where they got that from, I don't think you need it.

Phouter0499 commented 1 month ago

The audio breaks in @Phouter0499 example, is it expected? :/

What os are you using? I tried this code again on my linux machine and looks like the code doesn't work and the audio seems like it was sped up or something. When I tried this on my windows laptop. It worked just fine.

FerLuisxd commented 1 month ago

I'm using windows, and using the latest 6.1.11 version Python is 3.11.3

Thanks for your test! What you mean by "it worked just fine", I feel like what @Phouter0499 wanted is to be similar to the output .mp3 file but without having to wait for the whole file. In my computer it says it letter by letter and with noticiable spaces between them, I'd say around 0.3% speed of the .mp3 file What about when you tried it?

Phouter0499 commented 1 month ago

I'm using windows, and using the latest 6.1.11 version Python is 3.11.3

Thanks for your test! What you mean by "it worked just fine", I feel like what @Phouter0499 wanted is to be similar to the output .mp3 file but without having to wait for the whole file. In my computer it says it letter by letter and with noticiable spaces between them, I'd say around 0.3% speed of the .mp3 file What about when you tried it?

sorry for the late response. I meant that when I made my example, it was able to successfully play the sounds returned from the communicate.stream method very smoothly without breakage or weird whatever. However, when I tested this example on my windows just now, it doesn't work anymore. Frankly I have no idea what is happening. I am going to continue working on this....

FerLuisxd commented 1 month ago

@Phouter0499 , I saw an error with your code and the reason of my it may be producing sounds faster, and that is because of the rate you put originally, it is not 26000 but 24000, if you increase the number even further it may sound faster but higher pitched

Phouter0499 commented 1 month ago

So this time I think I made it work but there is a limitation. So I tested the below code with the first of the sherlock holmes series. However, there does seem to be a slight breakage after every 100 chunks BUT the streaming was instant...please try this one and tell me if it works. You also have to install just_playback and download some book or long text file.

import edge_tts
import just_playback
import os
import time

with open("pg1661.txt", "r", encoding="utf-8") as f:
    TEXT = f.read()
VOICE = "en-US-AndrewMultilingualNeural"

def main() -> None:
    communicate = edge_tts.Communicate(TEXT, VOICE)
    group_chunk_size = 100
    n_chunk_written = 0
    playback = just_playback.Playback()
    for chunk in communicate.stream_sync():
        if chunk["type"] == "audio":
            try:
                # this is last chunk (empty)
                if chunk['data'] == b'':
                    continue
                with open("temp.mp3", "ab") as f:
                    f.write(chunk['data'])
                    n_chunk_written += 1
                if n_chunk_written == group_chunk_size:
                    playback.load_file('temp.mp3')
                    playback.play()
                    while playback.active:
                        time.sleep(0.001)
                    n_chunk_written = 0
                    os.remove("temp.mp3")                    
            except Exception as e:
                print("Error processing audio chunk:", e)

if __name__ == "__main__":
    if os.path.exists("temp.mp3"):
        os.remove("temp.mp3")
    main()
FerLuisxd commented 1 month ago

Wow this is much better! Tried it with a random gpt response I had (around 300 words). It played correctly for a few seconds (around 5) but then I got:

Error processing audio chunk: [WinError 32] The process cannot access the file because it is being used by another process: 'temp.mp3'
Error processing audio chunk: [Errno 13] Permission denied: 'temp.mp3'

It also seems to always fail the same time everytime I run it πŸ€”. Wonder if it is necessary to save it into the file and not play it on the go...

rany2 commented 1 month ago

@FerLuisxd I took a look at the just_playback code, I think you should do this (sorry untested and on mobile atm):

import edge_tts
import just_playback
import os
import time

with open("pg1661.txt", "r", encoding="utf-8") as f:
    TEXT = f.read()
VOICE = "en-US-AndrewMultilingualNeural"

def main() -> None:
    communicate = edge_tts.Communicate(TEXT, VOICE)
    group_chunk_size = 100
    n_chunk_written = 0
    for chunk in communicate.stream_sync():
        if chunk["type"] == "audio":
            try:
                # this is last chunk (empty)
                if chunk['data'] == b'':
                    continue
                with open("temp.mp3", "ab") as f:
                    f.write(chunk['data'])
                    n_chunk_written += 1
                if n_chunk_written == group_chunk_size:
                    playback = just_playback.Playback('temp.mp3')
                    playback.play()
                    while playback.active:
                        time.sleep(0.001)
                    playback.__del__()
                    n_chunk_written = 0
                    os.remove("temp.mp3")                    
            except Exception as e:
                print("Error processing audio chunk:", e)

if __name__ == "__main__":
    if os.path.exists("temp.mp3"):
        os.remove("temp.mp3")
    main()
Phouter0499 commented 1 month ago

@FerLuisxd I took a look at the just_playback code, I think you should do this (sorry untested and on mobile atm):

playback.__del__() causes "Segmentation fault (core dumped)" error.

once I removed this line everything worked but not sure whether it will work in windows. someone try it.

rany2 commented 1 month ago

It's probably a bug with that library, it's not closing the file once it's done with it.

FerLuisxd commented 1 month ago

Sadly that did not fix the issue, it still fails at the exact same time :/

$ python github.py
Segmentation fault
FerLuisxd commented 1 month ago

Hey @Phouter0499 ! I think Claude Sonnet helped me here! Please try this with your own text and tell me how it goes! Also tried it with longer TEXT from before and it worked! It didn't crash this time!

import edge_tts
import pyaudio
from io import BytesIO
from pydub import AudioSegment

TEXT = 'Hello World! How are you guys doing? I hope great, cause I am having fun and honestly it has been a blast'
VOICE = "en-US-AndrewMultilingualNeural"
CHUNK_SIZE = 20

def main() -> None:
    communicator = edge_tts.Communicate(TEXT, VOICE)
    audio_chunks = []

    pyaudio_instance = pyaudio.PyAudio()
    audio_stream = pyaudio_instance.open(format=pyaudio.paInt16, channels=1, rate=24000, output=True)

    for chunk in communicator.stream_sync():
        if chunk["type"] == "audio" and chunk["data"]:
            audio_chunks.append(chunk["data"])
            if len(audio_chunks) >= CHUNK_SIZE:
                play_audio_chunks(audio_chunks, audio_stream)
                audio_chunks.clear()

    # Play the rest of the audio
    play_audio_chunks(audio_chunks, audio_stream)

    audio_stream.stop_stream()
    audio_stream.close()
    pyaudio_instance.terminate()

def play_audio_chunks(chunks: list[bytes], stream: pyaudio.Stream) -> None:
    stream.write(AudioSegment.from_mp3(BytesIO(b''.join(chunks))).raw_data)

if __name__ == "__main__":
    main()
Phouter0499 commented 1 month ago

Hey @Phouter0499 ! I think Claude Sonnet helped me here! Please try this with your own text and tell me how it goes! Also tried it with longer TEXT from before and it worked! It didn't crash this time!

works on my linux machine. I notice the "ALSA lib pcm.c:8568:(snd_pcm_recover) underrun occurred" errors but this isn't too bad. worked with my sherlock holmes book.

FerLuisxd commented 1 month ago

Maybe try to play around with the group_chunk_size?

Phouter0499 commented 1 month ago

Maybe try to play around with the group_chunk_size?

100 is good for me but maybe tomorrow I will try some concurrency to get rid of the usage of group_chunk_size. Can you ask Claude to do that perhaps?

FerLuisxd commented 1 month ago

I just asked and it seems to be a CPU bottleneck related? I don't see that error at all on my PC, @rany2 @omicroh do you think you could test my code and let us know as well? πŸ™ Edit: Sorry I just read about concurrency, I don't understand, why would we need concurrency here? We could use communicate.stream() (no _sync) with asyncio, or is that what you want? I'm using the free tier of Claude btw

Edit2: playing around but also a similar solution but without using arrays (hopefully a bit more performant?) But in my tests they perform around the same and in my pc this one cuts a bit

import edge_tts
import pyaudio
from io import BytesIO
from pydub import AudioSegment
import time

TEXT = 'Hello World! How are you guys doing? I hope great, cause I am having fun and honestly it has been a blast'
VOICE = "en-US-AndrewMultilingualNeural"
CHUNK_SIZE = 20 * 240  # Assuming 240 bytes per chunk (adjust based on format)

def main() -> None:
  start_time = time.time()
  communicator = edge_tts.Communicate(TEXT, VOICE)

  pyaudio_instance = pyaudio.PyAudio()
  audio_stream = pyaudio_instance.open(format=pyaudio.paInt16, channels=1, rate=24000, output=True)

  total_data = b''  # Store audio data instead of chunks

  for chunk in communicator.stream_sync():
    if chunk["type"] == "audio" and chunk["data"]:
      total_data += chunk["data"]
      if len(total_data) >= CHUNK_SIZE:
        print(f"Time elapsed: {time.time() - start_time:.2f} seconds")  # Print time
        play_audio(total_data[:CHUNK_SIZE], audio_stream)  # Play first CHUNK_SIZE bytes
        total_data = total_data[CHUNK_SIZE:]  # Remove played data

  # Play remaining audio
  play_audio(total_data, audio_stream)

  audio_stream.stop_stream()
  audio_stream.close()
  pyaudio_instance.terminate()

def play_audio(data: bytes, stream: pyaudio.Stream) -> None:
  stream.write(AudioSegment.from_mp3(BytesIO(data)).raw_data)

if __name__ == "__main__":
  main()