ibab / tensorflow-wavenet

A TensorFlow implementation of DeepMind's WaveNet paper
MIT License
5.41k stars 1.3k forks source link

Sample RNN #192

Open nakosung opened 7 years ago

nakosung commented 7 years ago

Sample RNN has been published, which seems that tiered rnn outperforms conv stacks.

http://www.gitxiv.com/posts/kQhs4N6rRohXDcp89/samplernn-an-unconditional-end-to-end-neural-audio https://github.com/soroushmehr/sampleRNN_ICLR2017

image

tszumowski commented 7 years ago

@nakosung This is a very interesting alternative to WaveNet, and they published their code. Did you (or anyone else here) try this out for comparison?

nakosung commented 7 years ago

@tszumowski I haven't tested it yet. ;)

vonpost commented 7 years ago

@tszumowski I tried to get it running but I encountered too many bugs or inconsistencies in the code to actually get it working without hacking the whole thing.

If anyone was successful in starting a training session using that code I'd be very interested to see their solution.

richardassar commented 7 years ago

I've submitted a pull request which makes downloading and generating the MUSIC dataset pain-free, other than that no modification to the code was required.

The output occasionally becomes unstable but I've managed to generate long samples which remain coherent.

https://soundcloud.com/psylent-v/samplernn-sample-e33-i246301-t1800-tr10121-v10330-2

nakosung commented 7 years ago

SampleRNN seems to perform as good as wavenet.

https://soundcloud.com/nako-sung/sample_e6_i54546_t72-00_tr0

weixsong commented 7 years ago

@richardassar , your generated music sounds very good. What is the piano corpus that you used for training the model? could you share the corpus to me?

richardassar commented 7 years ago

Hi, the piano corpus is from archive.org

If you clone the SampleRNN repo and run the download script it will gather the corpus for you.

https://github.com/soroushmehr/sampleRNN_ICLR2017/blob/master/datasets/music/download_archive_preprocess.sh

I've trained a model on some other music, the band Tangerine Dream, maybe it can be called "Deep Tangerine Dream :) I'll upload that when I have decided on the best output sample.

If you decide to train using your own corpus be sure to compute your own mean/variance normalisation stats.

weixsong commented 7 years ago

@richardassar , thanks very much.

weixsong commented 7 years ago

Hi, @richardassar , I'm not quite understand about " compute your own mean/variance normalisation stats", it machine learning, usually we do feature normalization for better model convergence. In this wavenet experiment, do I need to normalize the music sound myself? Now I'm using the piano musics from SampleRNN.

richardassar commented 7 years ago

https://github.com/soroushmehr/sampleRNN_ICLR2017/blob/master/datasets/dataset.py#L28

On 24 January 2017 at 03:02, wei song notifications@github.com wrote:

Hi, @richardassar https://github.com/richardassar , I'm not quite understand about " compute your own mean/variance normalisation stats", it machine learning, usually we do feature normalization for better model convergence. In this wavenet experiment, do I need to normalize the music sound myself? Now I'm using the piano musics from SampleRNN.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/192#issuecomment-274688306, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1ALMBJYIARhV5Wt9RsFCriS6uQW10jks5rVWnVgaJpZM4LEXQD .

devinplatt commented 7 years ago

Hey @richardassar and @weixsong . From my reading of the SampleRNN paper (haven't tried their code just yet), the normalization of inputs is only applied for the GMM based models, which weren't the the best performing models anyways. You can see in that same file (dataset.py) that the normalization is only applied if real_valued==True (False is the default), so I don't think computing your own stats is necessary unless for some reason you want to use the real-valued input models.

richardassar commented 7 years ago

Ah yes, that's correct.

https://github.com/soroushmehr/sampleRNN_ICLR2017/blob/d1d77d7429fbff3b7af4b8c36b0dd395a2b0092d/datasets/dataset.py#L127

Thanks.

nakosung commented 7 years ago

In general, audio clips should be normalized by its rms value instead of peak value.

richardassar commented 7 years ago

The signal needs to be bounded because we model it as a conditional distribution of quantised amplitudes.

They should apply DC removal when using the mu-law nonlinearity, it's not required for the linear mode.

abhilashi commented 7 years ago

@nakosung Thanks for the post. Could you please share metrics on SampleRNN performance? Time to generate? On what hardware?

nakosung commented 7 years ago

@abhilashi Trained and generated on Titan XP. Generation speed is 0.5s per 1s clip, which is super fast comparing to WaveNet.

abhilashi commented 7 years ago

@nakosung Thanks much for the info 👍 I'm going to run it on Hindi data now!

richardassar commented 7 years ago

It handles multi-instrument music datasets quite nicely.

https://soundcloud.com/psylent-v/samplernn-tangerine-dream-1 https://soundcloud.com/psylent-v/samplernn-tangerine-dream-2

Trained on 32 hours of Tangerine Dream. I have plenty of other nice samples it generated.

abhilashi commented 7 years ago

Three tier trained on an hour of Hindi speech by Narendra Modi: https://soundcloud.com/abhilashi/sample-e23-i5751-t5-60-tr1-13 https://soundcloud.com/abhilashi/sample-e23-i5751-t5-60-tr1-11

dannybtran commented 7 years ago

Could SampleRNN be used for TTS? The paper uses the term "unconditional" which makes me think it cannot?

lemonzi commented 7 years ago

Yes, if you condition it. This was released just today: http://josesotelo.com/speechsynthesis/

El dt., 21 de febr. 2017 a les 15:09, Danny Tran (notifications@github.com) va escriure:

Could SampleRNN be used for TTS? The paper uses the term "unconditional" which makes me think it cannot?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/192#issuecomment-281465015, or mute the thread https://github.com/notifications/unsubscribe-auth/ADCF5pvrkuocH8M_k-fddBNm6cCCCCIHks5re0R6gaJpZM4LEXQD .

-- Quim Llimona http://lemonzi.me

Zeta36 commented 7 years ago

And here it is the source code: https://github.com/sotelo/parrot, and the paper: https://openreview.net/forum?id=B1VWyySKx

Cortexelus commented 6 years ago

We're using SampleRNN for music. Training: 1-2 days to train on NVIDIA V100 Inference: generate 100 4-minute audio clips in 10 minutes

http://dadabots.com/ http://dadabots.com/nips2017/generating-black-metal-and-math-rock.pdf https://theoutline.com/post/2556/this-frostbitten-black-metal-album-was-created-by-an-artificial-intelligence

devinroth commented 6 years ago

I love that people are working on this kind of stuff. At the moment it sounds very similar to the babbling that wavenet achieves with voices. I think the way forward will require a significantly larger receptive field of over 1 second to achieve a more musical output. The audio is impressive non-the-less.

Devin

On Dec 28, 2017, at 11:45 PM, CJ Carr notifications@github.com wrote:

We're using SampleRNN for music. Training: 1-2 days to train on NVIDIA V100 Inference: generate 100 4-minute audio clips in 10 minutes

http://dadabots.com/ http://dadabots.com/nips2017/generating-black-metal-and-math-rock.pdf https://theoutline.com/post/2556/this-frostbitten-black-metal-album-was-created-by-an-artificial-intelligence

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

richardassar commented 6 years ago

I have some interesting related projects in the works.

In the mean time check out https://github.com/richardassar/SampleRNN_torch

An interesting extension, and something to try with WaveNet also, is to replace the softmax output layer with a Logistic Mixture Model.

On 29 Dec 2017 17:46, "devinroth" notifications@github.com wrote:

I love that people are working on this kind of stuff. At the moment it sounds very similar to the babbling that wavenet achieves with voices. I think the way forward will require a significantly larger receptive field of over 1 second to achieve a more musical output. The audio is impressive non-the-less.

Devin

On Dec 28, 2017, at 11:45 PM, CJ Carr notifications@github.com wrote:

We're using SampleRNN for music. Training: 1-2 days to train on NVIDIA V100 Inference: generate 100 4-minute audio clips in 10 minutes

http://dadabots.com/ http://dadabots.com/nips2017/generating-black-metal-and-math-rock.pdf https://theoutline.com/post/2556/this-frostbitten-black- metal-album-was-created-by-an-artificial-intelligence

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/192#issuecomment-354477716, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1ALOYVl4NApI9Bl-1QSvsZSuh0OhYdks5tFSWIgaJpZM4LEXQD .

Cortexelus commented 6 years ago

@devinroth re: more musical output

Thanks!

I could be wrong, but IMO I think decreasing the wavenet receptive field is the answer to more musical output.

The ablation studies in the Tacotron 2 paper showed us 10.5ms - 21ms is a good receptive field size if the wavenet conditions on a high-level representation.

Wavenet is great at the low level. It makes audio sound natural and its MOS scores are almost realistic. Keep it there at the 20ms level. Condition it. Dedicate high-level structure to MIDI nets, symbolic music nets, or intermediate representations. Or go top-down with progressively-upsampled mel spectrograms. Do both bottom-up and top-down.

Because these unconditioned predict-the-next-sample models only learn bottom-up as they train. First noise, then texture, then individual hits and notes, then phrases and riffs, then rhythm if you're lucky. The last thing they would learn is song composition or music theory. Struggles to see the forest for the trees.

Even SampleRNN whose receptive field is "sorta unbounded" runs into this problem of learning composition. (LSTMs are able to hold onto memory for long periods of time, but because of TBPTT they are limited in learning when to recall/forget)

Food for thought: A dataset of mp3s has 100 million examples of low-level structure, but only dozens of examples of song-level structure.

Better to learn english from a text corpus than a speech corpus. No?

Cortexelus commented 6 years ago

Logistic mixtures is a good idea. Because pure samplewise loss is like trying to compare images pixelwise...using one-hot encodings for RGB.

devinroth commented 6 years ago

Haven’t read the Tacotron 2 paper yet. Makes sense what you say.

That being said, there is no reason to not work on this stuff in parallel. We have good methods of synthesizing sound already (although not at the potential that wavelet has to offer) that can be used for creating NN at a higher level. Back to what you said, the problem is lack of data on the higher levels. We have hundreds of thousands of hours of audio available but performance data of musicians is greatly lacking.

My goal is to create a NN musician that can read music, perform, and create audio from a given piece of music. Probably the hardest part is creating a good dataset of musician performance data. I’m starting with violin. Maybe record music students practicing while monitoring the bow velocity/pressure/bridge distance. Train a NN on the audio conditional to the violin performance data. And training a NN to interpret notes on a page is a completely different problem. Haven’t even begun with music theory/composition. Although I’m not entirely convinced that NN will be able to create a high quality composition at the calibre of the top professionals. I will be happy be proven wrong.

On Dec 29, 2017, at 12:30 PM, CJ Carr notifications@github.com wrote:

@devinroth https://github.com/devinroth re: more musical output

Thanks!

I could be wrong, but IMO I think decreasing the wavenet receptive field is the answer to more musical output.

The ablation studies in the Tacotron 2 paper showed us 10.5ms - 21ms is a good receptive field size if the wavenet conditions on a high-level representation.

Wavenet is great at the low level. It makes audio sound natural and its MOS scores are almost realistic. Keep it there at the 20ms level. Condition it. Dedicate high-level structure to MIDI nets, symbolic music nets, or intermediate representations. Or go top-down with progressively-upsampled mel spectrograms. Do both bottom-up and top-down.

Because these unconditioned predict-the-next-sample wavenet models only learn bottom-up as they train. First noise, then texture, then individual hits and notes, then phrases and riffs, then rhythm if you're lucky. The last thing they would learn is song composition or music theory. Struggles to see the forest for the trees.

Even SampleRNN whose receptive field is "sorta unbounded" runs into this problem of learning composition.

Food for thought: A dataset of mp3s has 100 million examples of low-level structure, but only dozens of examples of song-level structure.

Better to learn english from a text corpus than a speech corpus. No?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/192#issuecomment-354496940, or mute the thread https://github.com/notifications/unsubscribe-auth/AMaIEWNrObENHtbCRfX4hBImq17XjdX5ks5tFUvvgaJpZM4LEXQD.