microsoft / UniSpeech

UniSpeech - Large Scale Self-Supervised Learning for Speech
Other
406 stars 71 forks source link

Formula 6 in paper #16

Closed Sreyan88 closed 2 years ago

Sreyan88 commented 2 years ago

Hi there!

Great repo and paper. I had a question that I think maybe a mistake in my understanding of the paper/code. After reading through both I understand:

You are first doing CTC + Contrastive on labeled data "L" and then optional pre-training on "M". However, from your paper I understand that they should be solved as a single task with joint multi-task training (from formula 6 in paper). This does not reflect in the code.

Would be glad if you could please help. Thank You!

cywang97 commented 2 years ago

Hi, Your understanding of our code and paper is correct. We find that first multi-task training on labeled data "L" and then pre-training on unlabeled data "M" has a similar performance with joint multi-task learning on both "L" and "M". We choose to use two-stage pre-training in our implementation. In this way, we only need to pre-train a single model on "L" and use this model for the second stage multilingual pre-training.

Sreyan88 commented 2 years ago

Hi @cywang97 ,

Thank You for your response! However, I would request you to, if available and possible, please share the multi-task learning code!

I am really intrigued to find how the multiple batches during training would be coordinated. I am a bit confused when it comes to choosing batch sizes for a multi-task learning setup like yours. Suppose we have two datasets, one large-scale and the other a smaller one. With a similar batch size I a finding it confusing if the smaller data will be used for training more than the larger one for each epoch?

Any hints would also be highly appreciated!

Thank You!

cywang97 commented 2 years ago

Hi @Sreyan88, You can upsample the smaller corpus and use different batch sizes for the two datasets. Since multitask training requires more GPU memory than constrastive only, so the batch size for M can be larger than the batch size for L. You can refer https://drive.google.com/file/d/1gCXeKiaeWfTASPF0VMRLufoloSkPhpHj/view?usp=sharing for using multiple corpus.

You can setup the dataset in this way: unsup_dataset = FileAudioDataset(...) sup_dataset = AddTargetDataset(...) dataset = MultitaskDataset([unsup_dataset, sup_dataset], sample_ratios=[1.0, 1.0])

You can adjust the batch sizes and the sample ratios for each corpus. I hope this can help you.

Sreyan88 commented 2 years ago

Hi @cywang97 ,

Thank you for the inputs, would be definitely helpful. Also, any hints on how to change the criterion file as now it just expects a single input and calculates the CTC + Contrastive from there?

How do we let the criterion know which one to calculate only CTC on and which one to calculate CTC + Contrastive on?

cywang97 commented 2 years ago

You can add a flag in the function "collater" of each dataset to indicate which task should be conducted.

Sreyan88 commented 2 years ago

Hi @cywang97 ,

The MultitaskDataset shared above still seems to return just one sample of a dataset at a time? Can you please help me by pointing out the piece of code that sends two different samples from two different datasets? Thank You!

def __getitem__(self, idx):
        dataset_idx, sample_idx = self._get_dataset_and_sample_index(idx)
        sample = self.datasets[dataset_idx][sample_idx]
        sample["dataset_idx"] = dataset_idx
        return sample

Did you intend to give me the multicorpus dataset by any chance?

cywang97 commented 2 years ago

In my experiments, one batch always contains samples from a single dataset.