use align hard as target if we use aligner

lucidrains / voicebox-pytorch

Implementation of Voicebox, new SOTA Text-to-speech network from MetaAI, in Pytorch

MIT License

589 stars 49 forks source link

Closed manmay-nakhashi closed 1 year ago

manmay-nakhashi commented 1 year ago

@lucidrains we can call Duration predictor now and expand phonemes ids as we are doing it in naturalspeech2

lucidrains commented 1 year ago

yes indeed, will port over the logic from NS2 ~Monday~ Tuesday (Labor day weekend), and will also add the Spear-TTS text-to-semantic way of alignment

manmay-nakhashi commented 1 year ago

oversampled_ids = ids.repeat_interleave(durations) we just have to repeat tokens to duration output.

lucidrains commented 1 year ago

@manmay-nakhashi nice! i didn't know about this function 😃 🙏

do you know if can work for batches of duration and ids?

manmay-nakhashi commented 1 year ago

@lucidrains i think we have to loop over or vectorize this.

lucidrains commented 1 year ago

no worries, i have something makeshift here still need to mask out for variable lengths (durations do not all sum to the same value)

lucidrains commented 1 year ago

will work more on it next week