applying on large single language dataset

albluc24 commented 1 year ago

Hello, I am trying to apply this procedure on a very large(900+) ours of speech. The text corpora is equally large, because the speech is theoretically transcrived but being newspaper scrape results it is higly inpure. My question is can you please give me a rough prospect of what I need to edit to achieve this in the slurm file? I am trying to hack around something looking at the pre existent project structure but if you can help me it would be great as I have to pay for the gpu instance and having to throw away a model for a stupid mistake would be quite harsh. I am already pretraining a wave2vec2.0 base model on the raw audio, but I think I can figure out how to plug that in. Thanks!

lwang114 commented 1 year ago

First, you need to change the path variables to the packages (KALDI_ROOT, FAIRSEQ_ROOT, KENLM_ROOT, RVAD_ROOT, etc.), and then you may need to modify the dataset paths by setting $css_root. Since our dataset is multilingual rather than monolingual, you can simply remove $lang or $lg arguments in all the commands in run_slurm_cpy2.sh, prepare_text_css10.sh and prepare_css10.sh. If you don't want to use G2P, change $trans_type to "char" instead of "phn". Let me know if this is not enough to make it work and I am happy to answer other followup questions.

albluc24 commented 1 year ago

thanks for the help! I'll let you know as soon as I finish training the w2v model.

lwang114 / UnsupTTS

applying on large single language dataset #1