Hi, I am trying to reproduce the results presented in the paper "Controllable and Interpretable Singing Voice Decomposition via Assem-VC", with the CSD, NUS-48E and also with custom datasets. In the paper it is said that "all singing voices are split between 1-12 seconds and used for training with corresponding lyrics". I understand that the original .wav files of the datasets need to be splitted to shorter .wav files before building the metadata files with format "path_to_wav|transcription|speaker_id". However, I can't find any code in the repository for doing this. How is this splitting process done? Is it done manually for all the datasets?
Hi, I am trying to reproduce the results presented in the paper "Controllable and Interpretable Singing Voice Decomposition via Assem-VC", with the CSD, NUS-48E and also with custom datasets. In the paper it is said that "all singing voices are split between 1-12 seconds and used for training with corresponding lyrics". I understand that the original .wav files of the datasets need to be splitted to shorter .wav files before building the metadata files with format "path_to_wav|transcription|speaker_id". However, I can't find any code in the repository for doing this. How is this splitting process done? Is it done manually for all the datasets?
Thanks!