Combining speech and text in the encoder

microsoft / SpeechT5

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing

MIT License

1.16k stars 113 forks source link

Hi,

If you mean combining the unpaired text and speech, you can combine the speech and text batch at the MultitaskDataset. More specifically, sampling the same number of batches for the speech and text data, and combining the batches. Then for each batch, it will contain the speech and text data, which will be passed to the model. And you can add the corresponding target to the speech batch (like s2t).

For the parallel text and speech as the input to the encoder, you can add an extra text input in the s2t task (by adding one more column in the tsv file).

Thanks.

microsoft / SpeechT5

Combining speech and text in the encoder #13