Trouble training model without Crepe f0 estimation

voodoohop commented 4 years ago

Description

I am having trouble training models that don't rely on an f0 estimate from the Crepe pitch estimator. In my tests, whenever fundamental frequency estimation is part of the differential graph I cannot get any convergence of the additive synthesizer at all.

To reproduce it, I create a batch consisting of one sample generated with the additive synth as in the synths and effects tutorial notebook. I then try overfitting an autoencoder on that one sample, with code adapted from the training on one sample notebook.

The decoder uses an additive synthesizer too so, in theory, it should easily reconstruct the sample. Here is a Colab notebook that demonstrates the behavior. In order to make the model converge replace f0_encoder=f0_encoder with f0_encoder=None.

Results

Original Audio

Original Audio)

Reconstruction with an f0 encoder (3000 training steps)

Reconstruction f0 encoder After the first few training steps, the loss does not improve anymore (around 18.-19.).

Reconstruction with f0 from Crepe (100 training steps)

reconstruction f0 crepe) The model converges immediately with the loss going down to 3. in a short time.

Things I have tried

Pass the precalculated f0 estimate from Crepe to a dense layer with one output. Even then the model does not converge although a simple scaling of the input should be enough for reconstruction. I tried a combination of activation functions and rescaling.
In order to help avoid local minima, that could occur if the model optimizes for a fundamental frequency that is a multiple of the real fundamental frequency, I tried applying a coarse spectrogram loss using less FFT buckets. I didn't see any improvement.
Initialized a fake f0 estimator so it starts almost at the right frequency before starting to train with no success.

This is happening just trying to fit one sample. I tried fitting multiple samples too without success.

To Reproduce

Colab notebook

import time
import ddsp
from ddsp.training import (data, decoders, encoders, models, preprocessing, 
                           train_util)
import gin
import numpy as np
import tensorflow.compat.v2 as tf
import itertools

sample_rate = 16000

### Generate an audio sample using the additive synth

n_frames = 1000
hop_size = 64
n_samples = n_frames * hop_size

# Amplitude [batch, n_frames, 1].
# Make amplitude linearly decay over time.
amps = np.linspace(1.0, -3.0, n_frames,dtype=np.float32)
amps = amps[np.newaxis, :, np.newaxis]

# Harmonic Distribution [batch, n_frames, n_harmonics].
# Make harmonics decrease linearly with frequency.
n_harmonics = 20
harmonic_distribution = np.ones([n_frames, 1],dtype=np.float32) * np.linspace(1.0, -1.0, n_harmonics,dtype=np.float32)[np.newaxis, :]
harmonic_distribution = harmonic_distribution[np.newaxis, :, :]

# Fundamental frequency in Hz [batch, n_frames, 1].
f0_hz = 440.0 * np.ones([1, n_frames, 1],dtype=np.float32)

# Create synthesizer object.
additive_synth = ddsp.synths.Additive(n_samples=n_samples,
                                      scale_fn=ddsp.core.exp_sigmoid,
                                      sample_rate=sample_rate)

# Generate some audio.
audio = additive_synth(amps, harmonic_distribution, f0_hz)

# Create a batch of data (1 example) to train on

batch = {"audio": audio, "f0_hz": f0_hz, "amplitudes": amps, "loudness_db": np.ones_like(amps)}

dataset_iter = itertools.repeat(batch)
batch = next(dataset_iter)
audio = batch['audio']
n_samples = audio.shape[1]

### Create an autoencoder

# Create Neural Networks.
preprocessor = preprocessing.DefaultPreprocessor(time_steps=n_samples)

# f0 encoder
f0_encoder = encoders.ResnetF0Encoder(size="small")

encoder = encoders.MfccTimeDistributedRnnEncoder(rnn_channels = 256, 
                                                 rnn_type = 'gru', 
                                                 z_dims = 16, 
                                                 z_time_steps=125, 
                                                 f0_encoder=f0_encoder)
# set f0_encoder=None to use Crepe

decoder = decoders.RnnFcDecoder(rnn_channels = 256,
                                rnn_type = 'gru',
                                ch = 256,
                                layers_per_stack = 1,
                                output_splits = (('amps', 1),
                                                 ('harmonic_distribution', 45)))

# Create Processors.
additive = ddsp.synths.Additive(n_samples=n_samples, 
                                sample_rate=sample_rate,
                                name='additive')

# Create ProcessorGroup.
dag = [(additive, ['amps', 'harmonic_distribution', 'f0_hz'])]

processor_group = ddsp.processors.ProcessorGroup(dag=dag,
                                                 name='processor_group')

# Loss_functions
spectral_loss = ddsp.losses.SpectralLoss(loss_type='L1',
                                         mag_weight=1.0,
                                         logmag_weight=1.0)

strategy = train_util.get_strategy()

with strategy.scope():
  # Put it together in a model.
  model = models.Autoencoder(preprocessor=preprocessor,
                             encoder=encoder,
                             decoder=decoder,
                             processor_group=processor_group,
                             losses=[spectral_loss])
  trainer = train_util.Trainer(model, strategy, learning_rate=1e-3)

### Try overfitting to the synthetic sample

# Build model, easiest to just run forward pass.

trainer.build(batch)

for i in range(3000):
  losses = trainer.train_step(dataset_iter)
  res_str = 'step: {}\t'.format(i)
  for k, v in losses.items():
    res_str += '{}: {:.2f}\t'.format(k, v)
  print(res_str)

jesseengel commented 4 years ago

Hi, thanks for the in-depth study and posting all the resources.

This is actually expected behavior at the moment. As we said in the paper, when training on a full dataset like NSynth the f0 encoder model can get a small loss and learn to generate audio that a CREPE model classifies as having the right f0, but does not currently estimate the correct f0 internally. It often falls into the local minima of predicting an interger multiple of f0 and then doing the best to match the data by manipulating the harmonic distribution. Unlike other neural networks, this problem will be even more exacerbated in fitting a single datapoint (not having the stochasticity of SGD to help in optimization).

We have some follow-up work that overcomes these challenges, and are working on getting in prepared for a conference submission next month, at which time I'll clean it up and submit it to the repo. Sorry for the delay, or if the original paper was misleading, but I think there are actually several ways to tackle this challenge and we should hopefully have them robust and added soon.

voodoohop commented 4 years ago

Understood. That's good to know and thanks for all the amazing work. I'm really excited about the developments. Should I close this issue for now?

jesseengel commented 4 years ago

Yah, and I look forward to posting more when I have it :).

magenta / ddsp