Closed omicroh closed 1 month ago
Maybe it is supposed to be like this? I'm sorry but I don't use pydub but the p.open
line seems wrong.
stream = p.open(
format=pyaudio.paInt16,
channels=1,
rate=48000,
output=True
)
So I got the audio streaming working. Here is the code:
import asyncio
import edge_tts
from io import BytesIO
from pydub import AudioSegment
from pydub.playback import play
TEXT = "To Sherlock Holmes she is always the woman. I have seldom heard him mention her under any other name."
VOICE = "en-US-AndrewMultilingualNeural"
async def amain() -> None:
communicate = edge_tts.Communicate(TEXT, VOICE)
async for chunk in communicate.stream():
if chunk["type"] == "audio":
try:
buffer = BytesIO()
buffer.write(chunk["data"])
buffer.seek(0)
audio_segment = AudioSegment.from_mp3(buffer)
play(audio_segment)
# If this is the last chunk, break after writing to buffer
if chunk.get('end', False):
break
except Exception as e:
print("Error processing audio chunk:", e)
if __name__ == "__main__":
asyncio.run(amain())
just tested with even bigger text (chapter 1 of sherlock holmes) but i notice that the stream is not instantaneous. maybe the online service processes all of the text first before they start streaming???
@Phouter0499 maybe, but I'm sending you the chunks as soon as I receive them; so whatever it is it's an issue on Microsoft's end
The audio breaks in @Phouter0499 example, is it expected? :/
@FerLuisxd If you're using large text please try the version in master (not pypi), recently #190 was fixed but I didn't make a release yet.
It is a short text, I managed to improve the speed a bit but it still does not feel right. Here is the updated code
import asyncio
import edge_tts
import pyaudio
from io import BytesIO
from pydub import AudioSegment
TEXT = "Hello World!"
VOICE = "en-US-AndrewMultilingualNeural"
async def amain() -> None:
communicate = edge_tts.Communicate(TEXT, VOICE)
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16,
channels=1,
rate=24000,
output=True)
async for chunk in communicate.stream():
if chunk["type"] == "audio":
try:
stream.write(AudioSegment.from_file(BytesIO(chunk["data"]), format="mp3").raw_data)
except Exception as e:
print("Error processing audio chunk:", e)
stream.stop_stream()
stream.close()
p.terminate()
if __name__ == "__main__":
asyncio.run(amain())
Here is @Phouter0499 version:
import asyncio
import edge_tts
from io import BytesIO
from pydub import AudioSegment
from pydub.playback import play
TEXT = "Hello World!"
VOICE = "en-US-AndrewMultilingualNeural"
async def amain() -> None:
communicate = edge_tts.Communicate(TEXT, VOICE)
async for chunk in communicate.stream():
if chunk["type"] == "audio":
try:
buffer = BytesIO()
buffer.write(chunk["data"])
buffer.seek(0)
audio_segment = AudioSegment.from_mp3(buffer)
play(audio_segment)
except Exception as e:
print("Error processing audio chunk:", e)
if __name__ == "__main__":
asyncio.run(amain())
If you test both codes it still feels like it is reading it letter by letter, I'm wondering if there is a way so it can read it faster? π€
I wonder if @omicroh managed to find a solution as well
FYI there is no such thing as:
# If this is the last chunk, break after writing to buffer
if chunk.get('end', False):
break
I'm not sure where they got that from, I don't think you need it.
The audio breaks in @Phouter0499 example, is it expected? :/
What os are you using? I tried this code again on my linux machine and looks like the code doesn't work and the audio seems like it was sped up or something. When I tried this on my windows laptop. It worked just fine.
I'm using windows, and using the latest 6.1.11 version Python is 3.11.3
Thanks for your test! What you mean by "it worked just fine", I feel like what @Phouter0499 wanted is to be similar to the output .mp3 file but without having to wait for the whole file. In my computer it says it letter by letter and with noticiable spaces between them, I'd say around 0.3% speed of the .mp3 file What about when you tried it?
I'm using windows, and using the latest 6.1.11 version Python is 3.11.3
Thanks for your test! What you mean by "it worked just fine", I feel like what @Phouter0499 wanted is to be similar to the output .mp3 file but without having to wait for the whole file. In my computer it says it letter by letter and with noticiable spaces between them, I'd say around 0.3% speed of the .mp3 file What about when you tried it?
sorry for the late response. I meant that when I made my example, it was able to successfully play the sounds returned from the communicate.stream method very smoothly without breakage or weird whatever. However, when I tested this example on my windows just now, it doesn't work anymore. Frankly I have no idea what is happening. I am going to continue working on this....
@Phouter0499 , I saw an error with your code and the reason of my it may be producing sounds faster, and that is because of the rate you put originally, it is not 26000 but 24000, if you increase the number even further it may sound faster but higher pitched
So this time I think I made it work but there is a limitation. So I tested the below code with the first of the sherlock holmes series. However, there does seem to be a slight breakage after every 100 chunks BUT the streaming was instant...please try this one and tell me if it works. You also have to install just_playback and download some book or long text file.
import edge_tts
import just_playback
import os
import time
with open("pg1661.txt", "r", encoding="utf-8") as f:
TEXT = f.read()
VOICE = "en-US-AndrewMultilingualNeural"
def main() -> None:
communicate = edge_tts.Communicate(TEXT, VOICE)
group_chunk_size = 100
n_chunk_written = 0
playback = just_playback.Playback()
for chunk in communicate.stream_sync():
if chunk["type"] == "audio":
try:
# this is last chunk (empty)
if chunk['data'] == b'':
continue
with open("temp.mp3", "ab") as f:
f.write(chunk['data'])
n_chunk_written += 1
if n_chunk_written == group_chunk_size:
playback.load_file('temp.mp3')
playback.play()
while playback.active:
time.sleep(0.001)
n_chunk_written = 0
os.remove("temp.mp3")
except Exception as e:
print("Error processing audio chunk:", e)
if __name__ == "__main__":
if os.path.exists("temp.mp3"):
os.remove("temp.mp3")
main()
Wow this is much better! Tried it with a random gpt response I had (around 300 words). It played correctly for a few seconds (around 5) but then I got:
Error processing audio chunk: [WinError 32] The process cannot access the file because it is being used by another process: 'temp.mp3'
Error processing audio chunk: [Errno 13] Permission denied: 'temp.mp3'
It also seems to always fail the same time everytime I run it π€. Wonder if it is necessary to save it into the file and not play it on the go...
@FerLuisxd I took a look at the just_playback code, I think you should do this (sorry untested and on mobile atm):
import edge_tts
import just_playback
import os
import time
with open("pg1661.txt", "r", encoding="utf-8") as f:
TEXT = f.read()
VOICE = "en-US-AndrewMultilingualNeural"
def main() -> None:
communicate = edge_tts.Communicate(TEXT, VOICE)
group_chunk_size = 100
n_chunk_written = 0
for chunk in communicate.stream_sync():
if chunk["type"] == "audio":
try:
# this is last chunk (empty)
if chunk['data'] == b'':
continue
with open("temp.mp3", "ab") as f:
f.write(chunk['data'])
n_chunk_written += 1
if n_chunk_written == group_chunk_size:
playback = just_playback.Playback('temp.mp3')
playback.play()
while playback.active:
time.sleep(0.001)
playback.__del__()
n_chunk_written = 0
os.remove("temp.mp3")
except Exception as e:
print("Error processing audio chunk:", e)
if __name__ == "__main__":
if os.path.exists("temp.mp3"):
os.remove("temp.mp3")
main()
@FerLuisxd I took a look at the just_playback code, I think you should do this (sorry untested and on mobile atm):
playback.__del__() causes "Segmentation fault (core dumped)" error.
once I removed this line everything worked but not sure whether it will work in windows. someone try it.
It's probably a bug with that library, it's not closing the file once it's done with it.
Sadly that did not fix the issue, it still fails at the exact same time :/
$ python github.py
Segmentation fault
Hey @Phouter0499 ! I think Claude Sonnet helped me here! Please try this with your own text and tell me how it goes! Also tried it with longer TEXT from before and it worked! It didn't crash this time!
import edge_tts
import pyaudio
from io import BytesIO
from pydub import AudioSegment
TEXT = 'Hello World! How are you guys doing? I hope great, cause I am having fun and honestly it has been a blast'
VOICE = "en-US-AndrewMultilingualNeural"
CHUNK_SIZE = 20
def main() -> None:
communicator = edge_tts.Communicate(TEXT, VOICE)
audio_chunks = []
pyaudio_instance = pyaudio.PyAudio()
audio_stream = pyaudio_instance.open(format=pyaudio.paInt16, channels=1, rate=24000, output=True)
for chunk in communicator.stream_sync():
if chunk["type"] == "audio" and chunk["data"]:
audio_chunks.append(chunk["data"])
if len(audio_chunks) >= CHUNK_SIZE:
play_audio_chunks(audio_chunks, audio_stream)
audio_chunks.clear()
# Play the rest of the audio
play_audio_chunks(audio_chunks, audio_stream)
audio_stream.stop_stream()
audio_stream.close()
pyaudio_instance.terminate()
def play_audio_chunks(chunks: list[bytes], stream: pyaudio.Stream) -> None:
stream.write(AudioSegment.from_mp3(BytesIO(b''.join(chunks))).raw_data)
if __name__ == "__main__":
main()
Hey @Phouter0499 ! I think Claude Sonnet helped me here! Please try this with your own text and tell me how it goes! Also tried it with longer TEXT from before and it worked! It didn't crash this time!
works on my linux machine. I notice the "ALSA lib pcm.c:8568:(snd_pcm_recover) underrun occurred" errors but this isn't too bad. worked with my sherlock holmes book.
Maybe try to play around with the group_chunk_size?
Maybe try to play around with the group_chunk_size?
100 is good for me but maybe tomorrow I will try some concurrency to get rid of the usage of group_chunk_size. Can you ask Claude to do that perhaps?
I just asked and it seems to be a CPU bottleneck related? I don't see that error at all on my PC, @rany2 @omicroh do you think you could test my code and let us know as well? π Edit: Sorry I just read about concurrency, I don't understand, why would we need concurrency here? We could use communicate.stream() (no _sync) with asyncio, or is that what you want? I'm using the free tier of Claude btw
Edit2: playing around but also a similar solution but without using arrays (hopefully a bit more performant?) But in my tests they perform around the same and in my pc this one cuts a bit
import edge_tts
import pyaudio
from io import BytesIO
from pydub import AudioSegment
import time
TEXT = 'Hello World! How are you guys doing? I hope great, cause I am having fun and honestly it has been a blast'
VOICE = "en-US-AndrewMultilingualNeural"
CHUNK_SIZE = 20 * 240 # Assuming 240 bytes per chunk (adjust based on format)
def main() -> None:
start_time = time.time()
communicator = edge_tts.Communicate(TEXT, VOICE)
pyaudio_instance = pyaudio.PyAudio()
audio_stream = pyaudio_instance.open(format=pyaudio.paInt16, channels=1, rate=24000, output=True)
total_data = b'' # Store audio data instead of chunks
for chunk in communicator.stream_sync():
if chunk["type"] == "audio" and chunk["data"]:
total_data += chunk["data"]
if len(total_data) >= CHUNK_SIZE:
print(f"Time elapsed: {time.time() - start_time:.2f} seconds") # Print time
play_audio(total_data[:CHUNK_SIZE], audio_stream) # Play first CHUNK_SIZE bytes
total_data = total_data[CHUNK_SIZE:] # Remove played data
# Play remaining audio
play_audio(total_data, audio_stream)
audio_stream.stop_stream()
audio_stream.close()
pyaudio_instance.terminate()
def play_audio(data: bytes, stream: pyaudio.Stream) -> None:
stream.write(AudioSegment.from_mp3(BytesIO(data)).raw_data)
if __name__ == "__main__":
main()
Hello there,
I've been trying to implement fast audio streaming for 6 days now, but I just can't do it. Indeed, without the stream method, the save function of edge-tts takes about 1-2s to generate depending the text, which is too long.
In my code bellow, the audio is indeed played instantly, regardless of text size, but there are artefacts between chunks, like tiny spaces.
Do you know how to do audio streaming correctly please ? Thank you!
import asyncio import edge_tts import pyaudio from io import BytesIO from pydub import AudioSegment
TEXT = "Hello World!" VOICE = "en-GB-SoniaNeural"
p = pyaudio.PyAudio()
async def stream_tts(text: str, voice: str) -> None:
if name == "main":
Run the asyncio event loop
@rany2