microsoft / SpeechT5

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
MIT License
1.16k stars 113 forks source link

Combining speech and text in the encoder #13

Closed jacqle closed 2 years ago

jacqle commented 2 years ago

Hi,

Do you think it would be possible to combine both speech and text as input to the encoder? I'm looking to then decode text based on this multimodal input. Should I be looking at the MultitaskDataset? Would the s2t task work for this?

Thanks.

Ajyy commented 2 years ago

Hi,

If you mean combining the unpaired text and speech, you can combine the speech and text batch at the MultitaskDataset. More specifically, sampling the same number of batches for the speech and text data, and combining the batches. Then for each batch, it will contain the speech and text data, which will be passed to the model. And you can add the corresponding target to the speech batch (like s2t).

For the parallel text and speech as the input to the encoder, you can add an extra text input in the s2t task (by adding one more column in the tsv file).

Thanks.