Closed pfriesch closed 6 years ago
Is an embedding necessary to train multiple speakers? Or could a one_hot encoding be sufficient?
Not necessary, but I like embedding. With embedding we can get semantically meaningful interpretation. For example. male / female:
Yes, makes sense.
But I still don't get how the global conditioning is supposed to work.
In your test, you condition on [[[0]]]
as g
. Is the speaker id instead of a one_hot expected?
import numpy as np
import torch
import torch.nn.functional as F
import wavenet_vocoder
from nnmnkwii import preprocessing as P
from numpy import linspace, sin, pi, int16
from torch.autograd import Variable
sr = 4000
# tone synthesis
def note(freq, len, amp=1, rate=sr):
t = linspace(0, len, len * rate)
data = sin(2 * pi * freq * t) * amp
return data.astype(int16)
mu = 256
tone = [0] * 5
tone[0] = note(140, 2, amp=10000)
tone[1] = note(240, 2, amp=10000)
tone[2] = note(340, 2, amp=10000)
tone[3] = note(440, 2, amp=10000)
tone[4] = note(540, 2, amp=10000)
tone = np.array(tone)
tone_n = ((tone - (tone.min())) / ((tone.max()) - (tone.min()))) * 1.9 - 0.95
tone_mu = np.array([P.mulaw_quantize(t, mu) for t in tone_n])
speakers = list(range(5))
length = 8000
d = 32
num_speakers = 5
dim_speaker_embed = 3
wavenet = wavenet_vocoder.WaveNet(
out_channels=d,
kernel_size=4,
residual_channels=d,
gate_channels=d,
skip_out_channels=d,
cin_channels=d,
gin_channels=dim_speaker_embed,
n_speakers=num_speakers
)
B = 5 # batch size
opti = torch.optim.Adam(wavenet.parameters(), lr=1e-4)
train_loss = []
X, C, G = [], [], []
for speaker, x in enumerate(tone_mu):
speaker_one_hot = np.zeros((num_speakers), dtype=np.int64)
speaker_one_hot[speaker] = 1 # speaker / tone frequency
# + or - based on curr amplitude / some mock local cond
cond = (np.identity(2)[np.array((np.sign(tone[speaker]) + 1) / 2, dtype=int)]).T
x = np.identity(mu)[x].T
X.append(x)
C.append(cond)
G.append(speaker_one_hot)
X = np.array(X)
C = np.array(C)
G = np.array(G)
assert X.shape == (B, mu, length)
assert C.shape == (B, 2, length)
assert G.shape == (B, num_speakers)
x = Variable(torch.from_numpy(X)) # torch.Size([5, 256, 8000])
cond = Variable(torch.from_numpy(C)) # torch.Size([5, 2, 8000])
speaker_one_hot = Variable(torch.from_numpy(G)) # torch.Size([5, 5])
out = wavenet.forward(x=x, c=cond, g=speaker_one_hot)
loss_1_reconst = F.cross_entropy(out, x)
loss_1_reconst.backward(retain_graph=True)
opti.step()
train_loss.append(loss_1_reconst)
print(loss_1_reconst)
Throws:
Traceback (most recent call last): main()
...
File "...site-packages/wavenet_vocoder/wavenet.py", line 164, in forward
g_bct = _expand_global_features(B, T, g, bct=True)
File "...site-packages/wavenet_vocoder/wavenet.py", line 32, in _expand_global_features
g_bct = g.expand(B, -1, T)
RuntimeError: The expanded size of the tensor (8000) must match the existing size (5) at non-singleton dimension 2. at /Users/soumith/code/builder/wheel/pytorch-src/torch/lib/TH/generic/THTensor.c:309
I can reproduce. I will look into it later today.
Ok, I figured it out, the speaker shouldn't be encoded as one hot, just as the id / single Long.
Sorry about that. I noticed just now. I am working on support for one-hot vector as well.
and also clarify docstrings
Yeah, the docstring confused me the most ;)
https://gist.github.com/r9y9/47df1b63680275258014359337544d4b
Now you can use one-hot vector as well. Let me know if you still find something confusing. https://github.com/r9y9/wavenet_vocoder/commit/9aced5c8037ec9cc748ff17f4d6e85c967bb2760
https://github.com/r9y9/wavenet_vocoder/blob/4e517c0f1ae2471380b76c105c4ae297ebe834af/wavenet_vocoder/wavenet.py#L187-L192
I am trying to get the multi speaker conditioning to work (using it as lib). So
g.shape is (B x C'')
and theembedding_dim is D
, so Line 191 gives(B x D x C'')
. Yet,_expand_global_features
expects a g which has the shape(B x C) or (B x C x 1)
to expandC
to the whole sequence.Is an embedding necessary to train multiple speakers? Or could a one_hot encoding be sufficient?