[Help]: How to use MaskGCT to generate translated audio from an english video?

KylinMountain commented 3 weeks ago

Problem Overview

I have a video speaking english, and I want it to say Chinese in the same speed, keep synchronize between video and audio.

How to that? Is there any instruction? Thank you.

Steps Taken

(Detail your attempts to resolve the issue, including any relevant steps or processes.)

Config/File changes: ...
Run command: ...
See errors: ...

Expected Outcome

(A clear and concise description of what you expected to happen.)

Screenshots

(If applicable, add screenshots to help explain your problem.)

Environment Information

Operating System: [e.g. Ubuntu 20.04.5 LTS]
Python Version: [e.g. Python 3.9.15]
Driver & CUDA Version: [e.g. Driver 470.103.01 & CUDA 11.4]
Error Messages and Logs: [If applicable, provide any error messages or relevant log outputs]

Additional context

(Add any other context about the problem here.)

synthere commented 3 weeks ago

The general steps might be taken: 1) get text from the audio in the english video, using such tool like whisper or funasr; 2) translate the text into Chinese text; 3) generate Chinese audio from the translated chinese text in 2), using tts tool like MaskGCT 4) resync the audio with the original video.

KylinMountain commented 3 weeks ago

@synthere if try like this, we can't copy the accent of the orginal audio and control the tts speed as the original one.

synthere commented 3 weeks ago

The accent could be cloned using the voice cloning function, and the tts speed can be adjusted also. Actually, I just created a video dubbing tool the other day, which u may have a try here syntheredub

synthere commented 3 weeks ago

I also tried the maskgct, which can control the target duration. But the resulted audio is not exactly aligned with the original as shown below(Top is the original audio, the bottom generated).

So precise alignment and resynchronization are sometimes necessary.

open-mmlab / Amphion