Closed manmay-nakhashi closed 1 year ago
yes indeed, will port over the logic from NS2 ~Monday~ Tuesday (Labor day weekend), and will also add the Spear-TTS text-to-semantic way of alignment
oversampled_ids = ids.repeat_interleave(durations) we just have to repeat tokens to duration output.
@manmay-nakhashi nice! i didn't know about this function 😃 🙏
do you know if can work for batches of duration and ids?
@lucidrains i think we have to loop over or vectorize this.
no worries, i have something makeshift here still need to mask out for variable lengths (durations do not all sum to the same value)
will work more on it next week
@lucidrains we can call Duration predictor now and expand phonemes ids as we are doing it in naturalspeech2