Closed SvenShade closed 5 years ago
Just to be sure, you have 16kHz sampled data for training and want to generate 16kHz sounds from 16kHz mel-spectrogram prediction and 4kHz guided waveform, right? If so, I think it's possible. Just use upsampling layer(s) for guided waveform as same as upsampling layers for mel-spectrograms.
As for the test_inputs
, it's just for unittests to check if forward
and incremental_forward
give same result. See https://github.com/r9y9/wavenet_vocoder/blob/2bf9e78fdee5aef16a63747c82691877fa70c413/tests/test_model.py#L162-L177
Thanks! Though, I don't want to condition Wavenet with 4kHz waveforms. I'll still only condition it on the spectrograms. But at synthesis time, I'd like to replace every 4th timestep being predicted, with the ground truth from the 4kHz waveform. In that way, I want to save Wavenet from predicting 1/4 of the samples at all, and that would also guarantee that all the information in the 4kHz waveform is in the final, 16kHz waveform.
umm, I'm not sure what's the motivation. Could you elaborate? and which do you mean by 4kHz waveform?
In [1]: from scipy.io import wavfile
In [2]: import pysptk
In [3]: fs, x = wavfile.read(pysptk.util.example_audio_file())
In [4]: print(fs, len(x))
16000 64000
In [5]: len(x[::4])
Out[5]: 16000
or
In [6]: import resampy
In [9]: len(resampy.resample(x, fs, fs//4))
Out[9]: 16000
If you are talking about the first one, I think it should work. It sounds like semi-teacher-forcing generation.
That's right! I guess it could be thought of as teacher-forcing synthesis, so that the original lo-fi signal can be a part of the final one, unchanged. The motivation is audio super-resolution. I've trained Wavenet to take 16kHz spectrograms and produce 16kHz audio, but I'd like to use the original audio as the basis of the generated audio. Since 16kHz audio contains 4x as many samples as 4kHz audio, I want every 4th sample to be ground truth.
I think I might have it working. Thanks for your help!
@SvenShade I'm interested in something like this too. Could you share some more details about what you did?
Sure. I fed in the lo-fi audio using the test_inputs argument mentioned earlier. I tried predicting every 4th sample based on the corresponding sample in test_inputs (instead of the last predicted sample), as well as actually replacing the generated sample with the one from test_inputs. This sort of teacher-forcing generation didn't work well in either case. So, I ended up resampling the lo-fi audio to 16kHz, and before each generated output was added to the list of outputs in incremental_forward, I averaged it with the corresponding sample from the lo-fi audio. That seemed to help guide generation, but it's a naive approach. I'll continue working on it to see if there are better ways of using the original audio.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi @r9y9, brilliant work with your implementation!
Could you tell me if I can pass in lo-fi audio at synthesis time to help guide generation? I have audio at a sample rate of 4kHz, along with Mel-spectrogram predictions of that audio upsampled to 16kHz. I want to decode these spectrograms with Wavenet. Can I pass in the original lo-fi audio so that 1 in 4 synthesized samples will be ground truth, and 3 in 4 will be 'filled in'?
I see you have test_inputs defined as an argument for incremental_forward in wavenet.py. Is this an argument that could be altered for such a purpose?
Thanks very much!