huawei-noah / Speech-Backbones

This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab.
557 stars 115 forks source link

Model training question #26

Open Cpgrach opened 1 year ago

Cpgrach commented 1 year ago

Hi, thanks for sharing the code. I have a folder with wav files of different speakers. I don't understand what to do next to get the trained model. What type of files should be in the "mels" and "embeds" folders. How exactly to fill them. Maybe there is some more detailed instructions?

zwan074 commented 1 year ago

I have the same issue about the input data format. Please put more instructions

li1jkdaw commented 1 year ago

Hi! Sorry for such a brief description of training process in readme and such a late response (I hope it will still be useful to put it here).

The whole folders structure in your data directory data_dir (it is the directory that you set in train_enc.py and train_dec.py before training starts) should look like this:

data_dir/wavs/spk1/spk1_000001.wav ---------//----------/spk1_000002.wav and all other wav files for speaker spk1, then data_dir/wavs/spk2/spk2_abc.wav ---------//----------/spk2_xyz.wav and all other wav files for speaker spk2, and so on for all of your speakers. The important thing is that filenames of your wav files should start with "_", the remaining part can be any string uniquely describing the corresponding wav file.

As for mels and embeds subfolders, they should have the same structure: data_dir/mels/spk1/spk1_000001_mel.npy ---------//----------/spk1_000002_mel.npy ,.. data_dir/mels/spk2/spk2_abc_mel.npy ---------//----------/spk2_xyz_mel.npy ,.. data_dir/embeds/spk1/spk1_000001_embed.npy -----------//------------/spk1_000002_embed.npy ,.. data_dir/embeds/spk2/spk2_abc_embed.npy -----------//------------/spk2_xyz_embed.npy ,.. The important thing here is that npy file containing mel-spectrogram for some wav file should have the same name with _mel appended. The same holds for npy files containing speaker embeddings - they should be appended with _embed.

Calculating mel-spectrograms and speaker embeddings from wav files to fill the subfolders mels and embeds can be performed with functions get_mel and get_embed defined in the jupyter notebook inference.ipynb correspondingly. These functions return numpy arrays that should be saved using np.save.

After you do that, you can write some wav filenames (without ".wav") to filelists/valid.txt to use them for validation purposes. Also, if for some reasons you don't want specific wavs to be used at training, you can add them in the same format to filelists/exceptions.txt. Otherwise you can leave this file empty. Paths to valid.txt and exceptions.txt should be set in train_dec.py (variables val_file and exc_file respectively) along with the path to the data directory data_dir. After these paths there is also a list of training parameters in train_dec.py (like epochs, batch_size and learning_rate). Some other important model hyperparameters can be set in params.py.

Then you can finally launch train_dec.py with the pre-trained encoder in logs_enc directory. If you also want to train the encoder yourself (e.g. your language is different from English, or you want to use a dataset richer than LibriTTS), you have to do some additional data preparation.

For training encoder you'll need additional subfolders mels_mode and textgrids with the following structure: data_dir/mels_mode/spk1/spk1_000001_avgmel.npy -------------//--------------/spk1_000002_avgmel.npy ,.. data_dir/mels_mode/spk2/spk2_abc_avgmel.npy -------------//--------------/spk2_xyz_avgmel.npy ,.. data_dir/textgrids/spk1/spk1_000001.TextGrid ------------//------------/spk1_000002.TextGrid ,.. data_dir/textgrids/spk2/spk2_abc.TextGrid ------------//------------/spk2_xyz.TextGrid ,..

As for alignment TextGrid files in the subfolder textgrids, please refer to Montreal Forced Aligner for the instructions on how to get such alignment files from wavs. To get average voice mel-spectrograms in the subfolder mels_mode, please run get_avg_mels.ipynb jupyter noteboook.

After this has been done, you can launch train_enc.py to start training your encoder.

Cpgrach commented 1 year ago

Thank you very much for the answer. Can you tell me if there are any encoders for the Russian language? Or datasets on which you can train the encoder?

Biyani404198 commented 7 months ago

Hi! Sorry for such a brief description of training process in readme and such a late response (I hope it will still be useful to put it here).

The whole folders structure in your data directory _datadir (it is the directory that you set in _trainenc.py and _traindec.py before training starts) should look like this:

_data_dir/wavs/spk1/spk1000001.wav _---------//----------/spk1000002.wav and all other wav files for speaker spk1, then _data_dir/wavs/spk2/spk2abc.wav _---------//----------/spk2xyz.wav and all other wav files for speaker spk2, and so on for all of your speakers. The important thing is that filenames of your wav files should start with "___", the remaining part can be any string uniquely describing the corresponding wav file.

As for mels and embeds subfolders, they should have the same structure: _data_dir/mels/spk1/spk1_000001mel.npy _---------//----------/spk1_000002mel.npy ,.. _data_dir/mels/spk2/spk2_abcmel.npy _---------//----------/spk2_xyzmel.npy ,.. _data_dir/embeds/spk1/spk1_000001embed.npy _-----------//------------/spk1_000002embed.npy ,.. _data_dir/embeds/spk2/spk2_abcembed.npy _-----------//------------/spk2_xyzembed.npy ,.. The important thing here is that npy file containing mel-spectrogram for some wav file should have the same name with _mel appended. The same holds for npy files containing speaker embeddings - they should be appended with _embed.

Calculating mel-spectrograms and speaker embeddings from wav files to fill the subfolders mels and embeds can be performed with functions _getmel and _getembed defined in the jupyter notebook inference.ipynb correspondingly. These functions return numpy arrays that should be saved using np.save.

After you do that, you can write some wav filenames (without ".wav") to filelists/valid.txt to use them for validation purposes. Also, if for some reasons you don't want specific wavs to be used at training, you can add them in the same format to filelists/exceptions.txt. Otherwise you can leave this file empty. Paths to valid.txt and exceptions.txt should be set in _traindec.py (variables _valfile and _excfile respectively) along with the path to the data directory _datadir. After these paths there is also a list of training parameters in _traindec.py (like epochs, _batchsize and _learningrate). Some other important model hyperparameters can be set in params.py.

Then you can finally launch _traindec.py with the pre-trained encoder in _logsenc directory. If you also want to train the encoder yourself (e.g. your language is different from English, or you want to use a dataset richer than LibriTTS), you have to do some additional data preparation.

For training encoder you'll need additional subfolders _melsmode and textgrids with the following structure: _data_dir/mels_mode/spk1/spk1_000001avgmel.npy _-------------//--------------/spk1_000002avgmel.npy ,.. _data_dir/mels_mode/spk2/spk2_abcavgmel.npy _-------------//--------------/spk2_xyzavgmel.npy ,.. _data_dir/textgrids/spk1/spk1000001.TextGrid _------------//------------/spk1000002.TextGrid ,.. _data_dir/textgrids/spk2/spk2abc.TextGrid _------------//------------/spk2xyz.TextGrid ,..

As for alignment TextGrid files in the subfolder textgrids, please refer to Montreal Forced Aligner for the instructions on how to get such alignment files from wavs. To get average voice mel-spectrograms in the subfolder _melsmode, please run _get_avgmels.ipynb jupyter noteboook.

After this has been done, you can launch _trainenc.py to start training your encoder.

Hi, I have followed these steps and created textgrid files. Now I want to create mels_mode sub directory. I am using _get_avgmels.ipynb jupyter noteboook but Im only getting mels_mode and lens dictionary. There are no further process or instructions to create _avgmel.npy using these two dictionary created. Can you pls help.

li1jkdaw commented 1 month ago

Basically, for each audio file .wav you know which frame corresponds to which phoneme (you can extract this information from textgrid file by calculating start_frame and end_frame as in get_avg_mels.ipynb), and then for each frame replace mel feature in _mel.npy file with the average feature of the corresponding phoneme -- mels_mode dictionary contains mapping {phoneme: its average mel feature}.