jianchang512 / pyvideotrans

Translate the video from one language to another and add dubbing. 将视频从一种语言翻译为另一种语言,并支持api调用
https://pyvideotrans.com
GNU General Public License v3.0
10.21k stars 1.13k forks source link

希望能优化TTS功能 #418

Closed extremk closed 2 months ago

extremk commented 4 months ago

我前面看到一个类似的视频配音软件,想到了一种思路来攻克TTS音色不好的问题

在TTS合成的时候一次一句提交,会造成 Azure的晓晓多语言等新TTS无法利用上下文信息,修正效果

突破的方法是一次提交50句,而不是每一次一句,每一句停顿8秒,然后识别停顿,把停顿达到7秒的部分给拆开

ffmpeg 识别停顿部分代码参考

def detect_silence(input_file, silence_threshold="-42dB", silence_duration="5"): command = [ "ffmpeg", "-i", input_file, "-af", f"silencedetect=noise={silence_threshold}:d={silence_duration}", "-f", "null", "-" ] result = subprocess.run(command, text=True, capture_output=True, encoding='utf-8') return result.stderr

拆分音频代码参考

def split_audio(input_file, silence_intervals, output_folder): if not os.path.exists(output_folder): os.makedirs(output_folder)

audio_segments = []
previous_end = 0

base_name = os.path.basename(input_file)
name_without_ext = os.path.splitext(base_name)[0]
sub_folder = os.path.join(output_folder, name_without_ext)
if not os.path.exists(sub_folder):
    os.makedirs(sub_folder)

for i, (silence_start, silence_end) in enumerate(silence_intervals):
    segment_filename = os.path.join(sub_folder, f"{name_without_ext}_{i + 1:04d}.wav")
    audio_segments.append(segment_filename)
    command = [
        "ffmpeg",
        "-n",  # 防止覆盖文件
        "-i", input_file,
        "-ss", str(previous_end),
        "-to", str(silence_start),
        # "-af", "silenceremove=start_periods=1:start_silence=0.1:start_threshold=-42dB:detection=rms,"
        #        "silenceremove=stop_periods=1:stop_silence=0.1:stop_threshold=-42dB:detection=rms",
        segment_filename
    ]
    subprocess.run(command, encoding='utf-8')
    previous_end = silence_end

# 最后一段
segment_filename = os.path.join(sub_folder, f"{name_without_ext}_{len(silence_intervals) + 1:04d}.wav")
audio_segments.append(segment_filename)
command = [
    "ffmpeg",
    "-n",  # 防止覆盖文件
    "-i", input_file,
    "-ss", str(previous_end),
    "-af", "silenceremove=start_periods=1:start_silence=0.1:start_threshold=-42dB:detection=rms,"
           "silenceremove=stop_periods=1:stop_silence=0.1:stop_threshold=-42dB:detection=rms",
    segment_filename
]
subprocess.run(command, encoding='utf-8')

return audio_segments

带停顿的合成代码参考:

def synthesize_speech_with_pause(srt, filename):

先设置实际要合成的文本

ssml = ("<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' "
        "xmlns:mstts='http://www.w3.org/2001/mstts' xml:lang='zh-CN'>"
        "<voice name='zh-CN-XiaoxiaoMultilingualNeural'>")  # 替换为您选择的语音名称

for event in srt.events:
    if event.text.strip():  # 检查是否为空句子 不为空就继续
        ssml += f"<p>{event.text.strip()}</p><break time='8s'/>"  #
ssml += "</voice></speak>"

# 设置语音服务的订阅密钥和区域
subscription_key = ""
region = ""

# 创建语音配置对象
speech_config = speechsdk.SpeechConfig(subscription=subscription_key, region=region)
speech_config.set_speech_synthesis_output_format(speechsdk.SpeechSynthesisOutputFormat.Riff48Khz16BitMonoPcm)

# 创建音频输出配置对象,将音频保存到文件中
audio_output = speechsdk.audio.AudioOutputConfig(filename=filename)

# 创建语音合成器对象,并指定音频输出配置
synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_output)

# 合成语音并保存到文件
result = synthesizer.speak_ssml_async(ssml).get()

# 检查合成结果
if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
    print(f"合成成功,文件已保存为 {filename}")
elif result.reason == speechsdk.ResultReason.Canceled:
    cancellation_details = result.cancellation_details
    print(f"合成取消: {cancellation_details.reason}")
    if cancellation_details.reason == speechsdk.CancellationReason.Error:
        print(f"错误详情: {cancellation_details.error_details}")

剩下的步骤和以前一样,把拆分开的音频文件继续按照以前方式处理,即可有效解决TTS前后风格不一致的问题 可以有效解决 晓晓多语言版 听起来不舒服的问题

extremk commented 4 months ago

一次提交多句话,新的TTS 神经网络版 可以充分利用上下文信息修正,配音会自然得多。传统方法是一次提交一句,有500行就提交500次,这样不能利用上下文信息,被切断 Azure 一次最多可以合成 10分钟语音,停顿的部分也算在内,建议处理的时候,使用PCM(wav)格式做中间文件格式 合成的时候,建议做设置 speech_config.set_speech_synthesis_output_format(speechsdk.SpeechSynthesisOutputFormat.Riff48Khz16BitMonoPcm) 这样合成的音质会好非常多

一次最大提交行数 建议采用动态的计算,1个汉字算1秒,然后计算行数,一行停顿8~10秒,加起来不超过600秒 提交一次,这样可以最大限度的让神经网络TTS利用上下文信息 第二批提交,建议和第一次提交重叠最后一句,重叠的部分在合成中丢弃,这样可以最大限度保持上下文关联信息,使得配音自然