Closed Sreyan88 closed 2 years ago
Hi, Your understanding of our code and paper is correct. We find that first multi-task training on labeled data "L" and then pre-training on unlabeled data "M" has a similar performance with joint multi-task learning on both "L" and "M". We choose to use two-stage pre-training in our implementation. In this way, we only need to pre-train a single model on "L" and use this model for the second stage multilingual pre-training.
Hi @cywang97 ,
Thank You for your response! However, I would request you to, if available and possible, please share the multi-task learning code!
I am really intrigued to find how the multiple batches during training would be coordinated. I am a bit confused when it comes to choosing batch sizes for a multi-task learning setup like yours. Suppose we have two datasets, one large-scale and the other a smaller one. With a similar batch size I a finding it confusing if the smaller data will be used for training more than the larger one for each epoch?
Any hints would also be highly appreciated!
Thank You!
Hi @Sreyan88, You can upsample the smaller corpus and use different batch sizes for the two datasets. Since multitask training requires more GPU memory than constrastive only, so the batch size for M can be larger than the batch size for L. You can refer https://drive.google.com/file/d/1gCXeKiaeWfTASPF0VMRLufoloSkPhpHj/view?usp=sharing for using multiple corpus.
You can setup the dataset in this way: unsup_dataset = FileAudioDataset(...) sup_dataset = AddTargetDataset(...) dataset = MultitaskDataset([unsup_dataset, sup_dataset], sample_ratios=[1.0, 1.0])
You can adjust the batch sizes and the sample ratios for each corpus. I hope this can help you.
Hi @cywang97 ,
Thank you for the inputs, would be definitely helpful. Also, any hints on how to change the criterion file as now it just expects a single input and calculates the CTC + Contrastive from there?
How do we let the criterion know which one to calculate only CTC on and which one to calculate CTC + Contrastive on?
You can add a flag in the function "collater" of each dataset to indicate which task should be conducted.
Hi @cywang97 ,
The MultitaskDataset shared above still seems to return just one sample of a dataset at a time? Can you please help me by pointing out the piece of code that sends two different samples from two different datasets? Thank You!
def __getitem__(self, idx):
dataset_idx, sample_idx = self._get_dataset_and_sample_index(idx)
sample = self.datasets[dataset_idx][sample_idx]
sample["dataset_idx"] = dataset_idx
return sample
Did you intend to give me the multicorpus dataset by any chance?
In my experiments, one batch always contains samples from a single dataset.
Hi there!
Great repo and paper. I had a question that I think maybe a mistake in my understanding of the paper/code. After reading through both I understand:
You are first doing CTC + Contrastive on labeled data "L" and then optional pre-training on "M". However, from your paper I understand that they should be solved as a single task with joint multi-task training (from formula 6 in paper). This does not reflect in the code.
Would be glad if you could please help. Thank You!