How to combine the monologue and dialogue dataset for training CoMix ?

tiamojames commented 1 month ago

Hi, I noticed in your paper that CoMix is trained using a combination of monologue and dialogue data. However, the monologue dataset only contains WAV files and lacks accompanying transcript text files. I would like to know how to effectively combine the monologue and dialogue datasets for training CoMix.

Thank you for your assistance!

vivian556123 commented 1 month ago

Hi,

The provided code can prepare the waveform and the transcripts into paired data. If not, please follow the original transcript of Fisher dataset to prepare the paired data. In order to use both monologue and dialogue data together, you can simply make all these data into the same format, and then train the CoMix model. If you want to use the monologue data to train a 2-channel CoMix, you can make the second channel into all silence tokens.

tiamojames commented 1 month ago

Hi,

The provided code can prepare the waveform and the transcripts into paired data. If not, please follow the original transcript of Fisher dataset to prepare the paired data. In order to use both monologue and dialogue data together, you can simply make all these data into the same format, and then train the CoMix model. If you want to use the monologue data to train a 2-channel CoMix, you can make the second channel into all silence tokens.

Hi, thank you for your help with the T2S model. However, I am encountering some issues with training the Acous VoMix model. In the covomix/data_module.py file, particularly between lines 230 and 243, it appears the code is expecting data file names with '-A.mel.npy' and '-A-16k.hubert_code.npy' suffixes. However, in the GitHub project, the data preparation script process_fisher_data_conversation.py only generates 8kHz, 2-channel audio files without separating them into '-A.wav' and '-B.wav' for each channel.

This discrepancy has left me confused about how the data preparation script differs from what covomix/data_module.py requires. For example, I am not sure if there are additional steps needed to format the data correctly or if there are separate scripts I should be using to create the required '-A' and '-B' format files for each channel, particularly at a 16kHz sampling rate.

Here are the screenshots showing the relevant parts of the covomix/data_module.py file and the expected file formats:

Could you clarify the correct data preparation steps to align with the Acous VoMix model’s requirements and how I should format the audio files to avoid errors in covomix/data_module.py? Thanks for your help

tiamojames commented 1 month ago

this audio file is processed by python data_preparation/process_fisher_data_conversation.py --audio_root=/home/node49_tmpdata2/hkxie/FisherEnglish --transcript_root=/home/node49_tmpdata2/hkxie/FisherEnglish --dest_root=/home/node49_tmpdata2/hkxie/FisherEnglish/dataset/Acous --remove_noises

vivian556123 commented 1 month ago

The -A and -B.wav can be extracted from the 2-channel waveform. You can manually separate them with sox command.

Thanks

tiamojames commented 1 month ago

The -A and -B.wav can be extracted from the 2-channel waveform. You can manually separate them with sox command.

Thanks

Thank you for the clarification on creating -A and -B.wav files.

Regarding the *-A-16k.hubert_code.npy files, does this mean I need to upsample the audio from 8kHz to 16kHz specifically for tokenization? Additionally, should the Mel spectrograms still be extracted from the original 8kHz audio, or do both tokenization and Mel extraction require 16kHz samples?

vivian556123 commented 1 month ago

Sorry for the misleading file naming in my code. Since HuBERT is trained on 16kHz data, therefore the semantic token (hubert_code.npy) is extracted with 16kHz utterances. But for other features such as mel-spectrogram, we still utilize 8kHz.

tiamojames commented 1 month ago

Sorry for the misleading file naming in my code. Since HuBERT is trained on 16kHz data, therefore the semantic token (hubert_code.npy) is extracted with 16kHz utterances. But for other features such as mel-spectrogram, we still utilize 8kHz.

That‘s alright. So I’ll upsample to 16kHz audio for the HuBERT tokens and still use 8khz for the mel-spectrograms. Thanks for your answer

vivian556123 / NeurIPS2024-CoVoMix

How to combine the monologue and dialogue dataset for training CoMix ? #2