Idea: sample_rate agnostic demo/tutorial

Forevian commented 4 years ago

I really enjoy tinkering with ddsp. It would be a bit more approachable if we could experiment more easily with 44.1kHz or the other standard audio formats. Could you perhaps make it more straightforward in the demos, or alternatively document what should we set differently to accommodate other sample rates for a whole training-synthesizing pipeline, or at least some "best advices"?

The ddsp_prepare_tfrecord function, for example, is not very forgiving with custom sample rates (it asserts because of crepe's 16kHz resampling producing some decimal paddings?).

I suppose, 16kHz is a default because we are stuck in the world of speech synthesis (of the '90-ies?), but what might be acceptable for telephony is just not acceptable for many audio use cases.

I hope you don't mind me opening an issue just because of an idea/rant, feel free to close it anytime, and keep up doing these amazing contributions to the world of audio/music!

jesseengel commented 4 years ago

Hi Andras,

Good recommendation! This is actually something that Hanoi is currently looking into, so we're on the case :). The tying to 16kHz is actually more for CREPE f0 detection than anything else, but it's not a hard constraint. We'll just need to change the data processing pipeline and tweak some model parameters (# of harmonics, sizes of ffts).

Forevian commented 4 years ago

Hi, I just wonder if there is any progress with this? I've seen some related pull request failing tests...

jesseengel commented 4 years ago

Thanks for the follow up. We just merged #44 which creates a data pipeline for creating datasets at arbitrary sample rates (with f0 CREPE detection still at 16kHz). We're working now on hammering out some details of training configs for higher sample rates (48kHz), and will add some details and configs to the colab notebooks when we get that figured out.

Forevian commented 4 years ago

Just wanted to add, that I have extensively tested 48 kHz training on the test branch that is waiting to be merged, and it works well (with some hyper-parameter tuning).

jesseengel commented 4 years ago

That's great! Yah, sorry for all the delays. There's been with some COVID related bureaucracy slowing down our efforts in that direction so cool to hear that it's working for you.

Do you have an example gin config / example you could share of it working? It could be helpful for us and others I think.

In terms of the branch PR, @lamtharnhantrakul is back on the case just now actually. The old branch (#57) had gotten pretty stale so he's splitting it up into two PRs, the first of which is now (#102). So hopefully we should have the code in master soon.

Forevian commented 4 years ago

Sorry for the slow answer, I don't have an example to show at this time, I am working on a different problem field compared to what your demo is doing, mostly percussive sound resynthesis with plenty of inharmonicity. I am trying an approach to generate lot of harmonically non-related tunable sine components + noise and reproduce single shot acoustic samples. I will let you know if I have anything cool to show.

jesseengel commented 4 years ago

Okay, great, no worries. For what it's worth, I've also been developing sinusoidal + noise models (still focusing on harmonic-ish type instruments) but for self-supervised transcription.

I think we're going to do a code refactor to expose a lot of that internal code in the next week or two, so feel free to take a look :).

jesseengel commented 4 years ago

Just FYI, all the sample_rate agnostic preprocessing code should now be in, (you can check if it works for you), but we don't have a working 44kHz model up as a demo yet.

samuel-clarke commented 4 years ago

It seems like the assumption of working with a 16kHz signal is still inextricably baked into this code in some places. A couple examples I've noticed:

MfccTimeDistributedRnnEncoder.z_time_steps is constrained to be chosen from a set of values that reflect the assumption that the input signal will be a 4 second clip with 16kHz sample rate.
spectral_ops.compute_mel(), a backbone to the other foundational functions in spectral_ops.py, is hardcoded to compute tf.signal.linear_to_mel_weight_matrix() with a 16kHz sample rate, putting the Nyquist frequency well below the upper bound of human hearing.

Please correct me if I'm wrong on these examples, since I'm still very much in the learning process. I'll put more examples here if I find them. And thank you for how helpful you've been @jesseengel

jesseengel commented 4 years ago

Good catches! Yah we definitely haven't explored training many model configs at different rates yet. They seem like pretty straight-forward fixes, we'll try to get to them when we can.

voodoohop commented 4 years ago

I just noticed that compute_loudness in spectral_ops.py outputs significantly lower loudness values when the sample rate is 48khz. I did not have time to figure out what is causing this, but increasing the FFT size didn't seem to help much.

PratikStar commented 1 year ago

@voodoohop I found the same problem, the loudness is too low at 44.1kHz audio! I am not sure of the status of the code for higher sample_rates, but I am trying to train on a custom guitar dataset at 44.1kHz and the results are quite poor!

Did you get a solution to this?

PratikStar commented 1 year ago

@jesseengel In my case, I am training the model on a custom guitar monophonic dataset (44.1kHz) to learn the timbre embeddings. I have set the frame_rate=210 & 252 and got poor results. So I am going to try training with higher frame rates!

But I am not sure if the root cause of the problem is in the low frame rate I used or in other model hyperparameters like fft size, #harmonics, etc.

Below are my commands.

Data prep: ddsp_prepare_tfrecord \ --input_audio_filepatterns='~/buckets/pratik-ddsp-data/monophonic/*wav' \ --output_tfrecord_path=~/tfrecord_441sr_700fr/train.tfrecord \ --chunk_secs=0.0 \ --num_shards=10 \ --frame_rate=700 \ --sample_rate=44100 \ --alsologtostderr

Below is for training process

ddsp_run \ --mode=train \ --gin_file=~/ddsp/ddsp/training/gin/models/ae_mfccRnnEncoder_last.gin \ --gin_file=~/ddsp/ddsp/training/gin/datasets/tfrecord.gin \ --gin_file=~/ddsp/ddsp/training/gin/eval/basic_f0_ld.gin \ --gin_param="TFRecordProvider.file_pattern='~/tfrecord_441sr_252fr/train.tfrecord*'" \ --gin_param="batch_size=16" \ --alsologtostderr \ --gin_param="TFRecordProvider.sample_rate=44100" \ --gin_param="Harmonic.sample_rate=44100" \ --gin_param="FilteredNoise.n_samples=176400" \ --gin_param="Harmonic.n_samples=176400" \ --gin_param="Reverb.reverb_length=176400" \ --gin_param='F0LoudnessPreprocessor.time_steps=2800' \ --gin_param='F0LoudnessPreprocessor.frame_rate=700' \ --gin_param='F0LoudnessPreprocessor.sample_rate=44100' \ --gin_param="TFRecordProvider.frame_rate=700"

Am I missing something?

magenta / ddsp

Idea: sample_rate agnostic demo/tutorial #20