Closed chenxwh closed 7 months ago
Thanks for the efforts Chenxi! This looks like an excellent demo.
Right now it seems that it's missing a few things: speech recognition, word alignment (without word alignment it's hard to decide the cut off second as it needs to be the end timestamp of a word), model selection (for now 330M or 830M), and batch_sample_size is missing from the adjustable parameters (which control speaking rate of the output).
I didn't mean to bombard you with a huge amount of feature request. There is a WIP PR on gradio app https://github.com/jasonppy/VoiceCraft/pull/54 , it's got all the features and mostly working, but sometimes has weird behaviors. I wonder if consider that PR (either help testing that, or migrate the features from that to Replicate)
Hi @jasonppy,
Thanks for the suggestions, I have uploaded the demo accordingly https://replicate.com/cjwbw/voicecraft. For alignment I am following your colab examples using mfa
(I do notice seems the Gradio demo does not use mfa
?), let me know what you think and happy to modify further!
Thanks! The demo looks great
I have a few comments:
pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft
)Thanks!
hi @jasonppy,
Sure thing happy to make the modifications as suggested! Although I have a quick question of the differences of using Whisper
and WhisperX
, as well as the transcribe
(https://huggingface.co/spaces/pyp1/VoiceCraft_gradio/blob/main/app.py#L126) and align
(https://huggingface.co/spaces/pyp1/VoiceCraft_gradio/blob/main/app.py#L166) functions in the gradio demo? As both seems to be returning the same thing? Is it that Whisper
alone is enough for the alignment? Does WhisperX
provide better results or?
Thank you!
Hi @jasonppy,
I just updated the demo incorporating you suggestions. As prev message I think there are some redundant functions in the gradio demo, but current Replicate demo replaces mfa
with whisperx
and gives the option to chose Whisper
or WhisperX
backbone. Let me know if other things needs updates :)
Thanks! really nice work!
I have tested both tts and speech editing-substitution and they worked out well.
Three more requests:
is it possible to remove submodule audiocraft? I can see that you did pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraf
in cog.yaml
which should install audiocraft in ./src
, and therefore the submodule is redundant?
is it possible to make it an two step process for both editing and tts, for better user experience
step 1: run whisperx to transcribe and align, and show the transcription and word alignment timestamps to the user, so that for TTS, user can choose cutoff word/timestamp; for speech editing, user can write down the target transcript more easily. User and modify the transcript in case of mistake by whisperx, and re-align using based on modified transcript
step2: run voicecraft model based on user input
This is basically what the gradio demo is doing, But the gradio demo not as intuitive for users. I think your Replicate demo could keep this 2-stage workflow, but with a more intuitive and user-friend interface.
I have updated instructions for hyperparameters in Gradio. In your case, the parameters that users might want to adjust are stop_repetition (default at 3 for tts, -1 for editing), sample_batch_size (speech_rate, default at 4 for tts, 1 for editing), maybe also left, right margin (default at 0.08), and topp (default at 0.9 for TTS and 0.8 for speech editing), kvcache (default at 1, and if the model is always running on A40, VRAM shouldn't be an issue). it would be nice if you could bring these to the front in hyperparam list.
you could delete whisper, and only keep whisperx, as the latter is strictly better than the former (faster, more accurate alignment, less vram consumption)
btw, I also have some strong TTS finetuned models forthcoming, and will announce those along with the community efforts on Replicate, Gradio, command line support, multi-span editing, vram reduction etc.
Thanks
Hi @jasonppy thanks for the comments!
I actually realised for cog.yaml
, pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraf
does not install the lib properly and show audiocraf
import error, hence I am using git clone https://github.com/facebookresearch/audiocraft && pip install -e ./audiocraft
as a workaround.
Regarding the multi-step process, indeed it would be a great feature and better user experience, however unfortunately Replicate demo only supports end2end inference.
I have now updated the parameters descriptions and also removed the whisper models :)
Hi @jasonppy,
Great work on
VoiceCraft
!This pull request makes it possible to run
VoiceCraft
on Replicate (https://replicate.com/cjwbw/VoiceCraft) and via API (https://replicate.com/cjwbw/rVoiceCraft/api). Currently, both speech-editing and TTS are in the same demo. Also we'd like to transfer the demo page to you so you can make modifications easily, and happy to help maintain/integrate the upcoming changes and improving the demo :)