jasonppy / VoiceCraft

Zero-Shot Speech Editing and Text-to-Speech in the Wild
Other
7.65k stars 749 forks source link

Add Replicate demo and API #73

Closed chenxwh closed 7 months ago

chenxwh commented 7 months ago

Hi @jasonppy,

Great work on VoiceCraft!

This pull request makes it possible to run VoiceCraft on Replicate (https://replicate.com/cjwbw/VoiceCraft) and via API (https://replicate.com/cjwbw/rVoiceCraft/api). Currently, both speech-editing and TTS are in the same demo. Also we'd like to transfer the demo page to you so you can make modifications easily, and happy to help maintain/integrate the upcoming changes and improving the demo :)

jasonppy commented 7 months ago

Thanks for the efforts Chenxi! This looks like an excellent demo.

Right now it seems that it's missing a few things: speech recognition, word alignment (without word alignment it's hard to decide the cut off second as it needs to be the end timestamp of a word), model selection (for now 330M or 830M), and batch_sample_size is missing from the adjustable parameters (which control speaking rate of the output).

I didn't mean to bombard you with a huge amount of feature request. There is a WIP PR on gradio app https://github.com/jasonppy/VoiceCraft/pull/54 , it's got all the features and mostly working, but sometimes has weird behaviors. I wonder if consider that PR (either help testing that, or migrate the features from that to Replicate)

chenxwh commented 7 months ago

Hi @jasonppy,

Thanks for the suggestions, I have uploaded the demo accordingly https://replicate.com/cjwbw/voicecraft. For alignment I am following your colab examples using mfa (I do notice seems the Gradio demo does not use mfa?), let me know what you think and happy to modify further!

jasonppy commented 7 months ago

Thanks! The demo looks great

I have a few comments:

  1. MFA is very slow. It can take 80% of the time when running the system, and if switching to WhisperX, it's down to less than 20% of the time. so I think WhisperX is worth trying
  2. is it possible to not add audiocraft as a submodule? this is just to avoid changing all my other setup instructions (as I used pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft)
  3. can .cog/tmp be gitignored?

Thanks!

chenxwh commented 7 months ago

hi @jasonppy,

Sure thing happy to make the modifications as suggested! Although I have a quick question of the differences of using Whisper and WhisperX, as well as the transcribe (https://huggingface.co/spaces/pyp1/VoiceCraft_gradio/blob/main/app.py#L126) and align (https://huggingface.co/spaces/pyp1/VoiceCraft_gradio/blob/main/app.py#L166) functions in the gradio demo? As both seems to be returning the same thing? Is it that Whisper alone is enough for the alignment? Does WhisperX provide better results or?

Thank you!

chenxwh commented 7 months ago

Hi @jasonppy,

I just updated the demo incorporating you suggestions. As prev message I think there are some redundant functions in the gradio demo, but current Replicate demo replaces mfa with whisperx and gives the option to chose Whisper or WhisperX backbone. Let me know if other things needs updates :)

jasonppy commented 7 months ago

Thanks! really nice work!

I have tested both tts and speech editing-substitution and they worked out well.

Three more requests:

  1. is it possible to remove submodule audiocraft? I can see that you did pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraf in cog.yaml which should install audiocraft in ./src, and therefore the submodule is redundant?

  2. is it possible to make it an two step process for both editing and tts, for better user experience

    step 1: run whisperx to transcribe and align, and show the transcription and word alignment timestamps to the user, so that for TTS, user can choose cutoff word/timestamp; for speech editing, user can write down the target transcript more easily. User and modify the transcript in case of mistake by whisperx, and re-align using based on modified transcript

    step2: run voicecraft model based on user input

This is basically what the gradio demo is doing, But the gradio demo not as intuitive for users. I think your Replicate demo could keep this 2-stage workflow, but with a more intuitive and user-friend interface.

  1. I have updated instructions for hyperparameters in Gradio. In your case, the parameters that users might want to adjust are stop_repetition (default at 3 for tts, -1 for editing), sample_batch_size (speech_rate, default at 4 for tts, 1 for editing), maybe also left, right margin (default at 0.08), and topp (default at 0.9 for TTS and 0.8 for speech editing), kvcache (default at 1, and if the model is always running on A40, VRAM shouldn't be an issue). it would be nice if you could bring these to the front in hyperparam list.

  2. you could delete whisper, and only keep whisperx, as the latter is strictly better than the former (faster, more accurate alignment, less vram consumption)

btw, I also have some strong TTS finetuned models forthcoming, and will announce those along with the community efforts on Replicate, Gradio, command line support, multi-span editing, vram reduction etc.

Thanks

chenxwh commented 7 months ago

Hi @jasonppy thanks for the comments!

I actually realised for cog.yaml, pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraf does not install the lib properly and show audiocraf import error, hence I am using git clone https://github.com/facebookresearch/audiocraft && pip install -e ./audiocraft as a workaround.

Regarding the multi-step process, indeed it would be a great feature and better user experience, however unfortunately Replicate demo only supports end2end inference.

I have now updated the parameters descriptions and also removed the whisper models :)