ibab / tensorflow-wavenet

A TensorFlow implementation of DeepMind's WaveNet paper
MIT License
5.42k stars 1.29k forks source link

updating wavenet_params #227

Open nakosung opened 7 years ago

nakosung commented 7 years ago

According to recent talk (https://www.youtube.com/watch?v=nsrSrYtKkT8) (1,2,4,...512) (1,2,4,...512) (1,2,4,...512) 30 layers

image

lmaxwell commented 7 years ago

Did Heiga mention 30layers or ( 1,2,4,...,124) in the talk? 3 stacks of (1,2,4,...1024) is actually 33 layers.

nakosung commented 7 years ago

@lmaxwell Yes. He mentioned "1024" and "three-times". (https://youtu.be/nsrSrYtKkT8?t=24m41s)

lmaxwell commented 7 years ago

@nakosung thanks, I'll try this. Also from the image above, the residual channel is 512, it 's really big.

nakosung commented 7 years ago

@lmaxwell Yes. My TitanXP cannot afford such a big model.

ucasyouzhao commented 7 years ago

@nakosung From the image above, the residual channel is 512, the dilation channel is 512 and the skip channel is 256. Is it right?

nakosung commented 7 years ago

@ucasyouzhao I think so. :)

greaber commented 7 years ago

Hey, I just looked at the video. I think you guys are right about the residual, dilation, and skip channels. But I think the dilation stack is 30 layers, (1,...,512,1,...,512,1,...,512). This agrees with the Wavenet paper; on page 3 they actually show exactly this stack. And it produces 30 layers, whereas going to 1024 would produce 33 as @lmaxwell points out. He also says that stacking the thing three times lets it extract about 3,000 timesteps. He says 1024 because that is about what the receptive field is.

By the way, he says something like, "At this moment you were using a 10 stack, so 1024," suggesting that perhaps they have moved on to using different parameters.

dannybtran commented 7 years ago

Yeah I think I agree w/ @greaber. (1,....,512) sums to ~ 1024. Three stacks is 3072 length receptive field which is about 192ms.

Also the 512 residual channels makes me sad. Quickly overloaded the memory on my two GTX 1080s :/

nakosung commented 7 years ago

I updated json. :)

vjravi commented 7 years ago

The receptive field as @dannybtran states is 192ms. Howver, in the paper they have mentioned that for the 'Multi-speaker Speech Generation' experiments they have used a model with a receptive field of about 300ms. And it was 240ms in case of 'Text-to-speech'. Since this talk is based on text-to-speech, it would be wise to have 2 separate models.