Closed manmay-nakhashi closed 1 year ago
@lucidrains please review this pr , i have added phoneme aligner , compute pitch function from pytorch , losses for aligner and expand and combine function for durations and pitch.
im combing the paper for mentions of the Alignernet you brought in, but i can't seem to find it
how did you conclude it is being used?
@lucidrains all non autoregrassive will need to use alignment to gide the speech generation from phonemes. as this is non autoregrassive i concluded that we need an aligner there are multiple ways to get the alignments. 1) from MFA montreal forced aligner 2) monotonic alignment search(this is also one other option) 3) Aligner network over here by nvidia which is under mit license (they claim and it is seen in practical to have a better alignment when audio length gets too long). Aligner network will be trained along with the tts training so it'll jointly learn durations.
@manmay-nakhashi thank you for your explanation! i believe you are correct that NAR solutions will require alignment still. they also needed some aligned conditioning signal for another NAR based TTS solution
let's try to get this merged by next week's end very latest! (i'll read that nvidia paper tomorrow morning)
thank you for the PR! :pray:
@lucidrains can we merge this now ?
@manmay-nakhashi oh hey, yea, but i left a few comments that were not addressed
@lucidrains i have resolved all the comments
@manmay-nakhashi thanks! there's one more comment though, regarding lack of sqrt
for the l2 distance. maybe i misunderstood something
could you also add a small test main script within the files that run without error?
ex.
if __name__ == '__main__':
align = AlignNet(...)
align(mock_tensor)
@lucidrains pr is ready tell me if i need to change anyting
@manmay-nakhashi thanks for the continuous polish
could you possibly remove all the print statements? also, i'm wondering why the maximum path function could not be done on the GPU?
@lucidrains fixed
@manmay-nakhashi In coqui tts there is this comment https://github.com/coqui-ai/TTS/blob/23a7a9a3633ee00e5bcd329d5f15b8c3f8971f8d/TTS/tts/utils/helpers.py#LL199C5-L199C74 but I haven't evaluated it myself.
@lexkoro i'll run the dummy loop and check
@lexkoro @lucidrains numpy is faster complete 1000 itr in 19 and torch completes it in 29 , i think because of gardient tracking :thinking:
@manmay-nakhashi if the hard path does not need gradients, you can wrap the function with the torch.no_grad
decorator
@lucidrains with torch.no_grad() i am having limited speed bump still not as fast as numpy. 1000 itr. came down to 24 sec.
@manmay-nakhashi good enough! we can always speed it up later!
@manmay-nakhashi thank you for pressing this PR!
@lucidrains don't need to use binary loss only aligner loss and duration loss.
@manmay-nakhashi could you link to the line of code you are looking at?
@manmay-nakhashi got it, thanks!
this modules helps to learn phoneme alignment.