Open nakosung opened 7 years ago
Did Heiga mention 30layers or ( 1,2,4,...,124) in the talk? 3 stacks of (1,2,4,...1024) is actually 33 layers.
@lmaxwell Yes. He mentioned "1024" and "three-times". (https://youtu.be/nsrSrYtKkT8?t=24m41s)
@nakosung thanks, I'll try this. Also from the image above, the residual channel is 512, it 's really big.
@lmaxwell Yes. My TitanXP cannot afford such a big model.
@nakosung From the image above, the residual channel is 512, the dilation channel is 512 and the skip channel is 256. Is it right?
@ucasyouzhao I think so. :)
Hey, I just looked at the video. I think you guys are right about the residual, dilation, and skip channels. But I think the dilation stack is 30 layers, (1,...,512,1,...,512,1,...,512). This agrees with the Wavenet paper; on page 3 they actually show exactly this stack. And it produces 30 layers, whereas going to 1024 would produce 33 as @lmaxwell points out. He also says that stacking the thing three times lets it extract about 3,000 timesteps. He says 1024 because that is about what the receptive field is.
By the way, he says something like, "At this moment you were using a 10 stack, so 1024," suggesting that perhaps they have moved on to using different parameters.
Yeah I think I agree w/ @greaber. (1,....,512) sums to ~ 1024. Three stacks is 3072 length receptive field which is about 192ms.
Also the 512 residual channels makes me sad. Quickly overloaded the memory on my two GTX 1080s :/
I updated json. :)
The receptive field as @dannybtran states is 192ms. Howver, in the paper they have mentioned that for the 'Multi-speaker Speech Generation' experiments they have used a model with a receptive field of about 300ms. And it was 240ms in case of 'Text-to-speech'. Since this talk is based on text-to-speech, it would be wise to have 2 separate models.
According to recent talk (https://www.youtube.com/watch?v=nsrSrYtKkT8) (1,2,4,...512) (1,2,4,...512) (1,2,4,...512) 30 layers