Hi, I'm wondering if you could help me. I'm trying to build your speaker-dependent vocoder in TensorFlow, but I'm struggling to understand how auxiliary input is feed to the network, is it added in parallel (two parallel layers) to the sample values and the output combined at a later layer? If you can point me in the direction of a text-book/article on auxiliary input/conditioning network I would be eternally grateful, I've looked many times and I can't find anything that gives a general undestanding of this.
Hi, I'm wondering if you could help me. I'm trying to build your speaker-dependent vocoder in TensorFlow, but I'm struggling to understand how auxiliary input is feed to the network, is it added in parallel (two parallel layers) to the sample values and the output combined at a later layer? If you can point me in the direction of a text-book/article on auxiliary input/conditioning network I would be eternally grateful, I've looked many times and I can't find anything that gives a general undestanding of this.