This is a repository for the paper, VoiceLDM: Text-to-Speech with Environmental Context, ICASSP 2024.
VoiceLDM is an extension of text-to-audio models so that it is also capable of generating linguistically intelligible speech.
[2024/05 Update] I have now added the code for training VoiceLDM! Refer to Training for more details.
pip install git+https://github.com/glory20h/VoiceLDM.git
OR
git clone https://github.com/glory20h/VoiceLDM.git
cd VoiceLDM
pip install -e .
Generate audio with description prompt and content prompt:
python generate.py --desc_prompt "She is talking in a park." --cont_prompt "Good morning! How are you feeling today?"
Generate audio with audio prompt and content prompt:
python generate.py --audio_prompt "whispering.wav" --cont_prompt "Good morning! How are you feeling today?"
Text-to-Speech Example:
python generate.py --desc_prompt "clean speech" --cont_prompt "Good morning! How are you feeling today?" --desc_guidance_scale 1 --cont_guidance_scale 9
Text-to-Audio Example:
python generate.py --desc_prompt "trumpet" --cont_prompt "_" --desc_guidance_scale 9 --cont_guidance_scale 1
Generated audios will be saved at the default output folder ./outputs
.
It's crucial to appropriately adjust the weights for dual classifier-free guidance. We find that this adjustment greatly influences the likelihood of obtaining satisfactory results. Here are some key tips:
Some weight settings are more effective for different prompts. Experiment with the weights and find the ideal combination that suits the specific use case.
Starting with 7 for both desc_guidance_scale
and cont_guidance_scale
is a good starting point.
If you feel that the generated audio doesn't align well with the provided content prompt, try decreasing the desc_guidance_scale
and increase the cont_guidance_scale
.
If you feel that the generated audio doesn't align well with the provided description prompt, try decreasing the cont_guidance_scale
and increase the desc_guidance_scale
.
View the full list of options with the following command:
python generate.py -h
The CSV files for the processed dataset used to train VoiceLDM can be found in here. These files include the transcriptions generated using the Whisper model.
as_speech_en.csv
(English speech segments from AudioSet)cv1.csv
(English speech segments from CommonVoice 13.0 en, it has been split into two to meet the file size limitations on GitHub.)cv2.csv
voxceleb.csv
(English speech segments from VoxCeleb1)as_noise.csv
(Non-speech segments from AudioSet)noise_demand.csv
(Non-speech segments from DEMAND)If you wish to train the model by yourself, follow these steps:
Configuration Setup (The trickiest part):
configs
folder to find the necessary configuration files. For example, VoiceLDM-M.yaml
is used for training the VoiceLDM-M model in the paper."paths"
and "noise_paths"
to the root path of your dataset. Also, take a look at the CSV files and ensure that the file_path
in these CSV files match the actual file path names in your dataset.cv_csv_path1
, cv_csv_path2
, as_speech_en_csv_path
, voxceleb_csv_path
, as_noise_csv_path
, and noise_demand_csv_path
in the YAML file. You may optionally leave it blank if you do not wish to use the corresponding csv file and training data.Configure Huggingface Accelerate:
accelerate config
This will allow support of CPU, single GPU, and multi-GPU training. Follow the on-screen instructions to configure your hardware settings.
Start Training:
accelerate launch train.py --config config/VoiceLDM-M.yaml
results
folder.Running Inference:
python generate.py --ckpt_path results/VoiceLDM-M/checkpoints/checkpoint_49/pytorch_model.bin --desc_prompt "She is talking in a park." --cont_prompt "Good morning! How are you feeling today?"
This work would not have been possible without the following repositories: