lucidrains / voicebox-pytorch

Implementation of Voicebox, new SOTA Text-to-speech network from MetaAI, in Pytorch
MIT License
589 stars 49 forks source link

input dimension and model dimension can be different #7

Closed yzmyyff closed 1 year ago

yzmyyff commented 1 year ago

The input of the audio model is 80-dim log mel in the paper. The model dimension is a hyperparameter that takes different values in different experiments. But in our impl these two values are merged to

VoiceBox(..., dim=, ...)

Can they be separated?

lucidrains commented 1 year ago

i'm currently going the encodec / voco route for starters

however, if you want to PR in the log model encoder / hifigan decoder, as in the paper, i can look into disentangling the dimensions earlier

yzmyyff commented 1 year ago

Okay, I'll look into it

lucidrains commented 1 year ago

@yzmyyff oh i meant the encoding and decoding logic, like this . are you doing log mel <-> hifigan?

lucidrains commented 1 year ago

@yzmyyff actually your PR looks good! thank you!

lucidrains commented 1 year ago

@yzmyyff i'll just take care of the mel <-> hifigan encoder / decoder this week