added aligner - Githubissues

lucidrains / naturalspeech2-pytorch

Implementation of Natural Speech 2, Zero-shot Speech and Singing Synthesizer, in Pytorch

MIT License

1.26k stars 100 forks source link

added aligner #17

Closed manmay-nakhashi closed 1 year ago

manmay-nakhashi commented 1 year ago

this modules helps to learn phoneme alignment.

manmay-nakhashi commented 1 year ago

@lucidrains please review this pr , i have added phoneme aligner , compute pitch function from pytorch , losses for aligner and expand and combine function for durations and pitch.

lucidrains commented 1 year ago

Screenshot from 2023-05-20 08-38-21

im combing the paper for mentions of the Alignernet you brought in, but i can't seem to find it

how did you conclude it is being used?

manmay-nakhashi commented 1 year ago

@lucidrains all non autoregrassive will need to use alignment to gide the speech generation from phonemes. as this is non autoregrassive i concluded that we need an aligner there are multiple ways to get the alignments. 1) from MFA montreal forced aligner 2) monotonic alignment search(this is also one other option) 3) Aligner network over here by nvidia which is under mit license (they claim and it is seen in practical to have a better alignment when audio length gets too long). Aligner network will be trained along with the tts training so it'll jointly learn durations.

lucidrains commented 1 year ago

@manmay-nakhashi thank you for your explanation! i believe you are correct that NAR solutions will require alignment still. they also needed some aligned conditioning signal for another NAR based TTS solution

let's try to get this merged by next week's end very latest! (i'll read that nvidia paper tomorrow morning)

thank you for the PR! :pray:

manmay-nakhashi commented 1 year ago

@lucidrains can we merge this now ?

lucidrains commented 1 year ago

@manmay-nakhashi oh hey, yea, but i left a few comments that were not addressed

manmay-nakhashi commented 1 year ago

@lucidrains i have resolved all the comments

lucidrains commented 1 year ago

@manmay-nakhashi thanks! there's one more comment though, regarding lack of sqrt for the l2 distance. maybe i misunderstood something

could you also add a small test main script within the files that run without error?

ex.

if __name__ == '__main__':
  align = AlignNet(...)
  align(mock_tensor)

manmay-nakhashi commented 1 year ago

@lucidrains pr is ready tell me if i need to change anyting

lucidrains commented 1 year ago

@manmay-nakhashi thanks for the continuous polish

could you possibly remove all the print statements? also, i'm wondering why the maximum path function could not be done on the GPU?

manmay-nakhashi commented 1 year ago

@lucidrains fixed

lexkoro commented 1 year ago

@manmay-nakhashi In coqui tts there is this comment https://github.com/coqui-ai/TTS/blob/23a7a9a3633ee00e5bcd329d5f15b8c3f8971f8d/TTS/tts/utils/helpers.py#LL199C5-L199C74 but I haven't evaluated it myself.

manmay-nakhashi commented 1 year ago

@lexkoro i'll run the dummy loop and check

manmay-nakhashi commented 1 year ago

@lexkoro @lucidrains numpy is faster complete 1000 itr in 19 and torch completes it in 29 , i think because of gardient tracking :thinking:

lucidrains commented 1 year ago

@manmay-nakhashi if the hard path does not need gradients, you can wrap the function with the torch.no_grad decorator

manmay-nakhashi commented 1 year ago

@lucidrains with torch.no_grad() i am having limited speed bump still not as fast as numpy. 1000 itr. came down to 24 sec.

lucidrains commented 1 year ago

@manmay-nakhashi good enough! we can always speed it up later!

lucidrains commented 1 year ago

@manmay-nakhashi thank you for pressing this PR!

manmay-nakhashi commented 1 year ago

@lucidrains don't need to use binary loss only aligner loss and duration loss.

lucidrains commented 1 year ago

@manmay-nakhashi could you link to the line of code you are looking at?

manmay-nakhashi commented 1 year ago

https://github.com/lucidrains/naturalspeech2-pytorch/blob/a9b168341d5c87038eece5e6286bb560a0073381/naturalspeech2_pytorch/aligner.py#L144

lucidrains commented 1 year ago

@manmay-nakhashi got it, thanks!