Closed jacqle closed 2 years ago
Hi,
If you mean combining the unpaired text and speech, you can combine the speech and text batch at the MultitaskDataset
. More specifically, sampling the same number of batches for the speech and text data, and combining the batches. Then for each batch, it will contain the speech and text data, which will be passed to the model. And you can add the corresponding target to the speech batch (like s2t
).
For the parallel text and speech as the input to the encoder, you can add an extra text input in the s2t
task (by adding one more column in the tsv file).
Thanks.
Hi,
Do you think it would be possible to combine both speech and text as input to the encoder? I'm looking to then decode text based on this multimodal input. Should I be looking at the
MultitaskDataset
? Would thes2t
task work for this?Thanks.