lucidrains / audiolm-pytorch

Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch
MIT License
2.39k stars 255 forks source link

Adapting AudioLM to support SingSong style accompaniment generation #86

Open smcio opened 1 year ago

smcio commented 1 year ago

Hi @lucidrains - thanks for your awesome work here. Great stuff as always.

I recently came across Google's new SingSong paper (https://arxiv.org/pdf/2301.12662.pdf), in which they adapt AudioLM for generation of instrumental accompaniments conditioned upon sung input vocals, and I was wondering if you (or anyone else 🙂 ) might have any practical advice on implementing the adaptations necessary.

Also, to this end, would you happen to know if anyone has managed to train a decent soundstream model and made it publicly available yet?

Best, and thanks again for your work here, Shaun

lucidrains commented 1 year ago

Yea I can take of the paper

Also, to this end, would you happen to know if anyone has managed to train a decent soundstream model and made it publicly available yet?

Not yet, but I reckon we will, given my sources :smile:

lucidrains commented 1 year ago

yea, both this and spear-tts may be too complicated to fit in this repository

i think many audio researchers are forking the audiolm repository within google and extending it to their own work, due to its success

Liujingxiu23 commented 1 year ago

@lucidrains I am also interested in singsong , and now prepare to train a FineTransformer model first. I am wondering 1,How many data are needed for trainning FineTransformer model? 2,How many steps are needed? 3,Is the code FineTransformerTrainer avaiable for the training? I revised part of the code, mostly interface and dataloader part to adapt to my own data , I just start to train the model, but I donot know if using this code is ok for singsong or musiclm.

lucidrains commented 1 year ago

@Liujingxiu23 ah, you can't just skip to the fine transformer. this work resembles a Matryoshka doll. you will need to train soundstream and 2 transformers successfully before even arriving at fine transformer as well as all the extra singsong networks.

lucidrains commented 1 year ago

this is why having open sourced foundation models are so important. no one else but internal google teams are able to carry out research

Liujingxiu23 commented 1 year ago

@lucidrains I use codes generated by the Encodec model of facebook, 24khz, 16-codebook, 6 for coarse ad 10 for fine, is that ok?

And about the code, I understand the generate process, code are generated one by one as a LOOP, but I do not understand the training process, why fine codes are send to the model to ? I do not see relative code of "transformer-decoder", I mean each fine code can only see its former codes and its own coarse code in the training process. Which part of code reflects this logic?I am sorry my previous work did not involve "transformer-decoder", I can not easily tell this.

mishav78 commented 2 months ago

did someone code the singsong implementation?

mishav78 commented 2 months ago

can you do it if I pay you Lucidrains?

lucidrains commented 2 months ago

@mishav78 there hasn't been better papers since?

mishav78 commented 2 months ago

this is the best. I listened to the Google demos. It works very well.