ibab / tensorflow-wavenet

A TensorFlow implementation of DeepMind's WaveNet paper
MIT License
5.41k stars 1.29k forks source link

More 'articulate' sounds and Meaning of --gc_channels=x and --gc_cardinality=x #321

Open albertious opened 6 years ago

albertious commented 6 years ago

Hi,

I have been getting some interesting results in the last few days.

I'm wondering if someone can help explain a couple of things for me that I am wondering about?

The default sample size in train.py is SAMPLE SIZE = 100000

If I change this to, say 200000, will this increase my chances of having more 'articulate' gibberish sounds beyond grunts that start and end abruptly?

Also, Assuming I am a total idiot, can anyone explain to me what these mean in plain English: --gc_channels=x and --gc_cardinality=x ?

Thanks in advance

A.

vjravi commented 6 years ago

Hey A,

The SAMPLE_SIZE is the size of a mini-batch that you are using. Say you have your network with a receptive field of 3072 samples. You are using that network as a filter of width 3072 to get you an output of 1 sample. Imagine sliding that filter over the 1D input which is of size 100000. Once you have done that, you are evaluating your losses and updating your weights. Changing the SAMPLE_SIZE will only change how frequently you update your weights and how big a batch size you want to evaluate in one go on your GPU.

The speakers are given one-hot encoded IDs. --gc_cardinality' is used to point out how many speakers are present in your dataset. So, it gives you the length of the one-hot encoded ID. --gc_channels` denotes the length of the vector you actually want as the input in each layer of your wavenet.

Say you have 100 speakers, i.e. you speaker IDs are 100 bit length one-hot encoded vectors. Now, you want to have the speaker ID as input to each layer as a vector of length 2. Then you can configure --gc_cardinality=100 and gc_channels=2. This leads to the creation of a 100x2 matrix that projects all speaker IDs to a corresponding vector of length 2. This projection can be used as the input to Wavenet as h. The projection matrix is also learnable.

Best! VJ

albertious commented 6 years ago

Thanks VJ! This begins to shine some light on the issues.

So just so I'm sure I understand....

The sample size won't effect the 'qualities' of the final model, it's just how much of the .wav is being analyzed in the GPU ram at any one moment? (I'm going to ignore the stuff about 1D and receptive fields for the moment, because I am of inferior intelligence!)

cardinality just specifies how many different voices are in the set of recordings? how does it tell the which voices are the same? does the corpus need metadata, or are voices separated by folder, or does it detect this by itself?

In practical terms how does changing the vector length affect the final generated output?

thanks again. :) a.

vjravi commented 6 years ago

The way this code has been written, it separates the speakers by the name of the folders, e.g. p001 for speaker 1. Changing the vector length would change the number of parameters in WaveNet that deal with the speaker ID. You can play around with this for a while.

As to how the network identifies the speakers, it is expected that the network is powerful enough to learn this with just the speaker IDs. It should be able to learn which speakers are sound similar.

I suggest you to try with a smaller dataset (3 male and 3 female) and then see if it is working. Next, move on to text analysis. I think bulk of the information about the spekaers will come when the text analysis is available and your network is bigger with context stacks.