open-mmlab / Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
https://openhlt.github.io/amphion/
MIT License
7.81k stars 590 forks source link

[Help]: How to use MaskGCT to generate translated audio from an english video? #335

Open KylinMountain opened 3 weeks ago

KylinMountain commented 3 weeks ago

Problem Overview

I have a video speaking english, and I want it to say Chinese in the same speed, keep synchronize between video and audio.

How to that? Is there any instruction? Thank you.

Steps Taken

(Detail your attempts to resolve the issue, including any relevant steps or processes.)

  1. Config/File changes: ...
  2. Run command: ...
  3. See errors: ...

Expected Outcome

(A clear and concise description of what you expected to happen.)

Screenshots

(If applicable, add screenshots to help explain your problem.)

Environment Information

Additional context

(Add any other context about the problem here.)

synthere commented 3 weeks ago

The general steps might be taken: 1) get text from the audio in the english video, using such tool like whisper or funasr; 2) translate the text into Chinese text; 3) generate Chinese audio from the translated chinese text in 2), using tts tool like MaskGCT 4) resync the audio with the original video.

KylinMountain commented 3 weeks ago

@synthere if try like this, we can't copy the accent of the orginal audio and control the tts speed as the original one.

synthere commented 3 weeks ago

The accent could be cloned using the voice cloning function, and the tts speed can be adjusted also. Actually, I just created a video dubbing tool the other day, which u may have a try here syntheredub

synthere commented 3 weeks ago

I also tried the maskgct, which can control the target duration. But the resulted audio is not exactly aligned with the original as shown below(Top is the original audio, the bottom generated). image

So precise alignment and resynchronization are sometimes necessary.