lucidrains / musiclm-pytorch

Implementation of MusicLM, Google's new SOTA model for music generation using attention networks, in Pytorch
MIT License
3.15k stars 254 forks source link

Exception when attempting to train #9

Closed djqualia closed 1 year ago

djqualia commented 1 year ago

i'm excited to try this out!

i attempted to train, feeding in a MockTextAudioDataset similar to the example on AudioLM's page (that worked with the semantic trainer there), but encountered the following exception: TypeError: 'int' object is not iterable

Full stack trace, in case it helps:

File "train_mulan.py", line 60, in trainer.train() File "<@beartype(musiclm_pytorch.trainer.MuLaNTrainer.train) at 0x7ff0e221f160>", line 30, in train File "/mnt/c/audio-ml-workspace/musiclm/musiclm_pytorch/trainer.py", line 363, in train logs = self.train_step() File "/mnt/c/audio-ml-workspace/musiclm/musiclm_pytorch/trainer.py", line 330, in train_step data_kwargs = self.data_tuple_to_kwargs(next(self.dl_iter)) File "/mnt/c/audio-ml-workspace/musiclm/musiclm_pytorch/trainer.py", line 57, in cycle for data in dl: File "/home/qualia/anaconda3/envs/audiolm/lib/python3.8/site-packages/accelerate/data_loader.py", line 375, in iter current_batch = next(dataloader_iter) File "/home/qualia/anaconda3/envs/audiolm/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in next data = self._next_data() File "/home/qualia/anaconda3/envs/audiolm/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/home/qualia/anaconda3/envs/audiolm/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch return self.collate_fn(data) File "/mnt/c/audio-ml-workspace/musiclm/musiclm_pytorch/trainer.py", line 146, in inner output = fn(datum) File "/mnt/c/audio-ml-workspace/musiclm/musiclm_pytorch/trainer.py", line 156, in curtail_to_shortest_collate min_len = min(*[datum.shape[0] for datum in data]) TypeError: 'int' object is not iterable

lucidrains commented 1 year ago

@djqualia i'm excited for you to try it! :smile:

is your dataset at any point not returning a tuple, but a single value? could you possibly show me your script?

lucidrains commented 1 year ago
import torch
from musiclm_pytorch import MusicLM, MuLaNTrainer
from musiclm_pytorch import MuLaN, AudioSpectrogramTransformer, TextTransformer, MuLaNEmbedQuantizer

audio_transformer = AudioSpectrogramTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64,
    spec_n_fft = 128,
    spec_win_length = 24,
    spec_aug_stretch_factor = 0.8
)

text_transformer = TextTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64
)

mulan = MuLaN(
    audio_transformer = audio_transformer,
    text_transformer = text_transformer
)

from torch.utils.data import Dataset

class MockTextAudioDataset(Dataset):
    def __init__(self, length = 100, audio_length = 320 * 32):
        super().__init__()
        self.audio_length = audio_length
        self.len = length

    def __len__(self):
        return self.len

    def __getitem__(self, idx):
        from random import randrange
        mock_audio = torch.randn(randrange(self.audio_length // 2, self.audio_length))
        mock_text = torch.randint(0, 12, (256,)).long()
        return mock_caption, mock_audio

trainer = MuLaNTrainer(
    mulan = mulan,
    dataset = MockTextAudioDataset(),
    batch_size = 4
)

trainer.train()

This seems to run fine for me

djqualia commented 1 year ago

I figured out that the key difference in my setup, which triggers this bug, is setting batch_size = 1. Then you should be able to reproduce AFAICT

djqualia commented 1 year ago

Do you think it's expected for the (contrastive?) loss to be both positive/negative?

spectrogram yielded shape of (65, 26667), but had to be cropped to (64, 26656) to be patchified for transformer 0: loss: -3.088079392910004e-05 0: saving model to /mnt/c/models/audiolm/mulantest 1: loss: -0.008366793394088745 2: loss: -0.00033165328204631805 3: loss: 0.0015411581844091415 4: loss: -0.0021463483572006226 5: loss: 0.003450023476034403 6: loss: -0.001429535448551178 7: loss: -0.001284077763557434 8: loss: 0.01317517552524805 9: loss: 0.07162240147590637 10: loss: -0.0007345117628574371 11: loss: 0.00026188790798187256 12: loss: 0.016698703169822693 13: loss: 0.0058104507625103 14: loss: 0.00037843361496925354 15: loss: -0.00018790364265441895 16: loss: -0.0001080445945262909 17: loss: -0.001908978447318077 18: loss: -0.00023999251425266266 19: loss: 0.03030853345990181 20: loss: -0.00021585077047348022 21: loss: 0.0001592119224369526 22: loss: -0.00013920455239713192 23: loss: -0.0021669212728738785 24: loss: 0.00395401194691658 25: loss: -5.50001859664917e-05 26: loss: -0.0026106592267751694 27: loss: -0.0008263345807790756 28: loss: 0.0012336960062384605 29: loss: 5.2521005272865295e-05 30: loss: -0.005257192999124527

lucidrains commented 1 year ago

@djqualia hey! fixed the issues, it had to do with batch size of 1 as you figured out

so in contrastive learning, one is forcing the network to play a game of matching up the pair of modalities (text audio in this case), so a batch size of 1 would mean there's nothing to play

for the negative numbers, i believe it is a result of the paper going with decoupled contrastive learning. it is my first time seeing this technique used in the wild, and i believe the loss may be negative

you can try turning it off and it should be all positive values

lucidrains commented 1 year ago

@djqualia so realistically, MuLaN won't be trained with the code in this repository

we should rely on making open-clip audio compatible

recently they managed to get CoCa working, and so we can easily reach SOTA for audio clip training (think audio clip + coca + some other features in that repository), and also get a great audio captioner to boot

ukemamaster commented 1 year ago

@lucidrains So how should we train the MuLan model? Is it ok if we use batch_size > 1 ?

lucidrains commented 1 year ago

@ukemamaster yup, you need very high batch sizes, like 128-256

this is why the job is better suited for a group like open clip

that group, affiliated with Laion, has done many SOTA open sourced CLIP models by now

lucidrains commented 1 year ago

you can try tinkering with it on a small scale though

ukemamaster commented 1 year ago

@lucidrains OK. And which dataset did you use to train the MuLaN and the AudioLM transformers?

lucidrains commented 1 year ago

@ukemamaster they are not trained at all yet, outside of google

if you are interested in having something trained, i would recommend joining Laion and getting in touch with Marianne. She is working on amassing a dataset

lucidrains commented 1 year ago

@ukemamaster i will put my back into getting MuLaN integrated into open-clip next week

ukemamaster commented 1 year ago

@lucidrains Yes. I am very interested in training and reproducing google's results. I will be waiting for the MuLaN integration. Thanks

djqualia commented 1 year ago

thanks for the education @lucidrains !

fwiw i intend to tinker with this mulan implementation (as large a batch size as possible) to see where i get, until another option is available :-)

to make sure i understand, when you talk about integration with open-clip, are you thinking of using open-clip as an implementation for parts of mulan here, or thinking of open-clip as an organization that has the means to create a publicly available model...?

epinnock commented 1 year ago

Hi, just wanted to ask is it possible to integrate the pretrained model from this paper instead of mulan? https://github.com/seungheondoh/music-text-representation/ @lucidrains

ukemamaster commented 1 year ago

@djqualia Which dataset you are using to train MuLan? and the AudioLM transformers?

ukemamaster commented 1 year ago

@djqualia @lucidrains It seems like MuLan needs a dataset containing audio and text in pairs, like:

sample_1 = [music_audio_1, text_1]
sample_2 = [music_audio_2, text_2]
.
.
.

But the SoundStream and the 3 transformers for the AudioLM can be trained using audios only.

So my question is:

Is it OK to train MuLaN using music dataset, and SoundStream plus the 3 transformers using a speech (non-music) dataset? yet conditioning them on the MuLaNEmbedQuantizer having MuLaN trained on music data?

OR all need to be trained on the same music dataset?

epinnock commented 1 year ago

The original paper said they used pretrained models which I believe all had different datasets

ukemamaster commented 1 year ago

@epinnock The original paper says they used pretrained weights only for MuLaN model. The rest (SoundStream, w2vBERT, and the 3 Transformers)were trained using music data.

Screenshot from 2023-02-09 17-34-08

However, the datasets for MuLaN(according to their original paper), and for MusicLM Transformers are not public.

epinnock commented 1 year ago

Thanks for updating this @ukemamaster. are there currently pretrained implementations of soundstream and w2v-BERT. also I saw this paper recently that implements music generation using MuLaN and diffusion. https://google-research.github.io/noise2music/noise2music.pdf. This is fairly outside my area of expertise, also instead of muLaN could you use this? https://github.com/seungheondoh/music-text-representation/ @lucidrains @ukemamaster

lucidrains commented 1 year ago

thanks for the education @lucidrains !

fwiw i intend to tinker with this mulan implementation (as large a batch size as possible) to see where i get, until another option is available :-)

to make sure i understand, when you talk about integration with open-clip, are you thinking of using open-clip as an implementation for parts of mulan here, or thinking of open-clip as an organization that has the means to create a publicly available model...?

yup both! we can train there, i can build wrappers that have common interfaces that support their pretrained models, as i've done for dalle2

their group, Laion, is a very legit crowd! many many successes by now

actually, i don't think open-clip is under Laion, just that they use the Laion dataset, and have a lot of infra support from Laion + Stability

lucidrains commented 1 year ago

Hi, just wanted to ask is it possible to integrate the pretrained model from this paper instead of mulan? https://github.com/seungheondoh/music-text-representation/ @lucidrains

i'm not sure, is it any good? honestly, i think there isn't a great open sourced foundation model for text-audio or text-music yet, or i've heard about it by now

lucidrains commented 1 year ago

@epinnock The original paper says they used pretrained weights only for MuLaN model. The rest (SoundStream, w2vBERT, and the 3 Transformers)were trained using music data.

Screenshot from 2023-02-09 17-34-08

However, the datasets for MuLaN(according to their original paper), and for MusicLM Transformers are not public.

yeah, you all should join Laion and start thinking about data in a collaborative manner

realistically, no one has had the level of success of Laion. i mean, they even have outstanding paper award at the last neurips. it would be wise to simply join their group at this point

lucidrains commented 1 year ago

ok, i'm going to close this issue, as it has been addressed

ukemamaster commented 1 year ago

@djqualia Have you tried training MuLaN with some real data? If yes, which data? and how do you convert text into tokens, to feed into the model?

djqualia commented 1 year ago

fwiw i have not yet gotten around to training mulan. i have limited hardware and am still trying to train soundstream/hubert first (while putting together a larger dataset) for music sources, consider FMA (free music archive) and jamendo and AudioSet.