r9y9 / wavenet_vocoder

WaveNet vocoder
https://r9y9.github.io/wavenet_vocoder/
Other
2.33k stars 500 forks source link

Multi speaker embedding not working?! #27

Closed pfriesch closed 6 years ago

pfriesch commented 6 years ago

https://github.com/r9y9/wavenet_vocoder/blob/4e517c0f1ae2471380b76c105c4ae297ebe834af/wavenet_vocoder/wavenet.py#L187-L192

I am trying to get the multi speaker conditioning to work (using it as lib). So g.shape is (B x C'') and the embedding_dim is D, so Line 191 gives (B x D x C''). Yet, _expand_global_features expects a g which has the shape (B x C) or (B x C x 1) to expand C to the whole sequence.

Is an embedding necessary to train multiple speakers? Or could a one_hot encoding be sufficient?

r9y9 commented 6 years ago

Is an embedding necessary to train multiple speakers? Or could a one_hot encoding be sufficient?

https://www.quora.com/What-is-the-difference-between-using-word2vec-vs-one-hot-embeddings-as-input-to-classifiers

Not necessary, but I like embedding. With embedding we can get semantically meaningful interpretation. For example. male / female:

speaker_embedding

pfriesch commented 6 years ago

Yes, makes sense.

But I still don't get how the global conditioning is supposed to work. In your test, you condition on [[[0]]] as g. Is the speaker id instead of a one_hot expected?

pfriesch commented 6 years ago
import numpy as np
import torch
import torch.nn.functional as F
import wavenet_vocoder
from nnmnkwii import preprocessing as P
from numpy import linspace, sin, pi, int16
from torch.autograd import Variable

sr = 4000

# tone synthesis
def note(freq, len, amp=1, rate=sr):
    t = linspace(0, len, len * rate)
    data = sin(2 * pi * freq * t) * amp
    return data.astype(int16)

mu = 256

tone = [0] * 5
tone[0] = note(140, 2, amp=10000)
tone[1] = note(240, 2, amp=10000)
tone[2] = note(340, 2, amp=10000)
tone[3] = note(440, 2, amp=10000)
tone[4] = note(540, 2, amp=10000)

tone = np.array(tone)

tone_n = ((tone - (tone.min())) / ((tone.max()) - (tone.min()))) * 1.9 - 0.95

tone_mu = np.array([P.mulaw_quantize(t, mu) for t in tone_n])

speakers = list(range(5))
length = 8000
d = 32
num_speakers = 5
dim_speaker_embed = 3

wavenet = wavenet_vocoder.WaveNet(
    out_channels=d,
    kernel_size=4,
    residual_channels=d,
    gate_channels=d,
    skip_out_channels=d,
    cin_channels=d,
    gin_channels=dim_speaker_embed,
    n_speakers=num_speakers
)

B = 5  # batch size
opti = torch.optim.Adam(wavenet.parameters(), lr=1e-4)

train_loss = []

X, C, G = [], [], []

for speaker, x in enumerate(tone_mu):
    speaker_one_hot = np.zeros((num_speakers), dtype=np.int64)
    speaker_one_hot[speaker] = 1  # speaker / tone frequency

    # + or - based on curr amplitude / some mock local cond
    cond = (np.identity(2)[np.array((np.sign(tone[speaker]) + 1) / 2, dtype=int)]).T

    x = np.identity(mu)[x].T

    X.append(x)
    C.append(cond)
    G.append(speaker_one_hot)

X = np.array(X)
C = np.array(C)
G = np.array(G)

assert X.shape == (B, mu, length)
assert C.shape == (B, 2, length)
assert G.shape == (B, num_speakers)

x = Variable(torch.from_numpy(X))  # torch.Size([5, 256, 8000])
cond = Variable(torch.from_numpy(C))  # torch.Size([5, 2, 8000])
speaker_one_hot = Variable(torch.from_numpy(G))  # torch.Size([5, 5])

out = wavenet.forward(x=x, c=cond, g=speaker_one_hot)

loss_1_reconst = F.cross_entropy(out, x)
loss_1_reconst.backward(retain_graph=True)
opti.step()
train_loss.append(loss_1_reconst)
print(loss_1_reconst)

Throws:

Traceback (most recent call last):    main()
...
  File "...site-packages/wavenet_vocoder/wavenet.py", line 164, in forward
    g_bct = _expand_global_features(B, T, g, bct=True)
  File "...site-packages/wavenet_vocoder/wavenet.py", line 32, in _expand_global_features
    g_bct = g.expand(B, -1, T)
RuntimeError: The expanded size of the tensor (8000) must match the existing size (5) at non-singleton dimension 2. at /Users/soumith/code/builder/wheel/pytorch-src/torch/lib/TH/generic/THTensor.c:309
r9y9 commented 6 years ago

I can reproduce. I will look into it later today.

pfriesch commented 6 years ago

Ok, I figured it out, the speaker shouldn't be encoded as one hot, just as the id / single Long.

r9y9 commented 6 years ago

Sorry about that. I noticed just now. I am working on support for one-hot vector as well.

r9y9 commented 6 years ago

and also clarify docstrings

pfriesch commented 6 years ago

Yeah, the docstring confused me the most ;)

r9y9 commented 6 years ago

https://gist.github.com/r9y9/47df1b63680275258014359337544d4b

Now you can use one-hot vector as well. Let me know if you still find something confusing. https://github.com/r9y9/wavenet_vocoder/commit/9aced5c8037ec9cc748ff17f4d6e85c967bb2760