Exception when attempting to train

djqualia commented 1 year ago

i'm excited to try this out!

i attempted to train, feeding in a MockTextAudioDataset similar to the example on AudioLM's page (that worked with the semantic trainer there), but encountered the following exception: TypeError: 'int' object is not iterable

Full stack trace, in case it helps:

File "train_mulan.py", line 60, in trainer.train() File "<@beartype(musiclm_pytorch.trainer.MuLaNTrainer.train) at 0x7ff0e221f160>", line 30, in train File "/mnt/c/audio-ml-workspace/musiclm/musiclm_pytorch/trainer.py", line 363, in train logs = self.train_step() File "/mnt/c/audio-ml-workspace/musiclm/musiclm_pytorch/trainer.py", line 330, in train_step data_kwargs = self.data_tuple_to_kwargs(next(self.dl_iter)) File "/mnt/c/audio-ml-workspace/musiclm/musiclm_pytorch/trainer.py", line 57, in cycle for data in dl: File "/home/qualia/anaconda3/envs/audiolm/lib/python3.8/site-packages/accelerate/data_loader.py", line 375, in iter current_batch = next(dataloader_iter) File "/home/qualia/anaconda3/envs/audiolm/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in next data = self._next_data() File "/home/qualia/anaconda3/envs/audiolm/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/home/qualia/anaconda3/envs/audiolm/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch return self.collate_fn(data) File "/mnt/c/audio-ml-workspace/musiclm/musiclm_pytorch/trainer.py", line 146, in inner output = fn(datum) File "/mnt/c/audio-ml-workspace/musiclm/musiclm_pytorch/trainer.py", line 156, in curtail_to_shortest_collate min_len = min(*[datum.shape[0] for datum in data]) TypeError: 'int' object is not iterable

lucidrains commented 1 year ago

@djqualia i'm excited for you to try it! :smile:

is your dataset at any point not returning a tuple, but a single value? could you possibly show me your script?

lucidrains commented 1 year ago

import torch
from musiclm_pytorch import MusicLM, MuLaNTrainer
from musiclm_pytorch import MuLaN, AudioSpectrogramTransformer, TextTransformer, MuLaNEmbedQuantizer

audio_transformer = AudioSpectrogramTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64,
    spec_n_fft = 128,
    spec_win_length = 24,
    spec_aug_stretch_factor = 0.8
)

text_transformer = TextTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64
)

mulan = MuLaN(
    audio_transformer = audio_transformer,
    text_transformer = text_transformer
)

from torch.utils.data import Dataset

class MockTextAudioDataset(Dataset):
    def __init__(self, length = 100, audio_length = 320 * 32):
        super().__init__()
        self.audio_length = audio_length
        self.len = length

    def __len__(self):
        return self.len

    def __getitem__(self, idx):
        from random import randrange
        mock_audio = torch.randn(randrange(self.audio_length // 2, self.audio_length))
        mock_text = torch.randint(0, 12, (256,)).long()
        return mock_caption, mock_audio

trainer = MuLaNTrainer(
    mulan = mulan,
    dataset = MockTextAudioDataset(),
    batch_size = 4
)

trainer.train()

This seems to run fine for me

djqualia commented 1 year ago

I figured out that the key difference in my setup, which triggers this bug, is setting batch_size = 1. Then you should be able to reproduce AFAICT

djqualia commented 1 year ago

Do you think it's expected for the (contrastive?) loss to be both positive/negative?

spectrogram yielded shape of (65, 26667), but had to be cropped to (64, 26656) to be patchified for transformer 0: loss: -3.088079392910004e-05 0: saving model to /mnt/c/models/audiolm/mulantest 1: loss: -0.008366793394088745 2: loss: -0.00033165328204631805 3: loss: 0.0015411581844091415 4: loss: -0.0021463483572006226 5: loss: 0.003450023476034403 6: loss: -0.001429535448551178 7: loss: -0.001284077763557434 8: loss: 0.01317517552524805 9: loss: 0.07162240147590637 10: loss: -0.0007345117628574371 11: loss: 0.00026188790798187256 12: loss: 0.016698703169822693 13: loss: 0.0058104507625103 14: loss: 0.00037843361496925354 15: loss: -0.00018790364265441895 16: loss: -0.0001080445945262909 17: loss: -0.001908978447318077 18: loss: -0.00023999251425266266 19: loss: 0.03030853345990181 20: loss: -0.00021585077047348022 21: loss: 0.0001592119224369526 22: loss: -0.00013920455239713192 23: loss: -0.0021669212728738785 24: loss: 0.00395401194691658 25: loss: -5.50001859664917e-05 26: loss: -0.0026106592267751694 27: loss: -0.0008263345807790756 28: loss: 0.0012336960062384605 29: loss: 5.2521005272865295e-05 30: loss: -0.005257192999124527

lucidrains commented 1 year ago

@djqualia hey! fixed the issues, it had to do with batch size of 1 as you figured out

so in contrastive learning, one is forcing the network to play a game of matching up the pair of modalities (text audio in this case), so a batch size of 1 would mean there's nothing to play

for the negative numbers, i believe it is a result of the paper going with decoupled contrastive learning. it is my first time seeing this technique used in the wild, and i believe the loss may be negative

you can try turning it off and it should be all positive values

lucidrains commented 1 year ago

@djqualia so realistically, MuLaN won't be trained with the code in this repository

we should rely on making open-clip audio compatible

recently they managed to get CoCa working, and so we can easily reach SOTA for audio clip training (think audio clip + coca + some other features in that repository), and also get a great audio captioner to boot

ukemamaster commented 1 year ago

@lucidrains So how should we train the MuLan model? Is it ok if we use batch_size > 1 ?

lucidrains commented 1 year ago

@ukemamaster yup, you need very high batch sizes, like 128-256

this is why the job is better suited for a group like open clip

that group, affiliated with Laion, has done many SOTA open sourced CLIP models by now

lucidrains commented 1 year ago

you can try tinkering with it on a small scale though

ukemamaster commented 1 year ago

@lucidrains OK. And which dataset did you use to train the MuLaN and the AudioLM transformers?

lucidrains commented 1 year ago

@ukemamaster they are not trained at all yet, outside of google

if you are interested in having something trained, i would recommend joining Laion and getting in touch with Marianne. She is working on amassing a dataset

lucidrains commented 1 year ago

@ukemamaster i will put my back into getting MuLaN integrated into open-clip next week

ukemamaster commented 1 year ago

@lucidrains Yes. I am very interested in training and reproducing google's results. I will be waiting for the MuLaN integration. Thanks

djqualia commented 1 year ago

thanks for the education @lucidrains !

fwiw i intend to tinker with this mulan implementation (as large a batch size as possible) to see where i get, until another option is available :-)

to make sure i understand, when you talk about integration with open-clip, are you thinking of using open-clip as an implementation for parts of mulan here, or thinking of open-clip as an organization that has the means to create a publicly available model...?

epinnock commented 1 year ago

Hi, just wanted to ask is it possible to integrate the pretrained model from this paper instead of mulan? https://github.com/seungheondoh/music-text-representation/ @lucidrains

ukemamaster commented 1 year ago

@djqualia Which dataset you are using to train MuLan? and the AudioLM transformers?

ukemamaster commented 1 year ago

@djqualia @lucidrains It seems like MuLan needs a dataset containing audio and text in pairs, like:

sample_1 = [music_audio_1, text_1]
sample_2 = [music_audio_2, text_2]
.
.
.

But the SoundStream and the 3 transformers for the AudioLM can be trained using audios only.

So my question is:

Is it OK to train MuLaN using music dataset, and SoundStream plus the 3 transformers using a speech (non-music) dataset? yet conditioning them on the MuLaNEmbedQuantizer having MuLaN trained on music data?

OR all need to be trained on the same music dataset?

epinnock commented 1 year ago

The original paper said they used pretrained models which I believe all had different datasets

ukemamaster commented 1 year ago

@epinnock The original paper says they used pretrained weights only for MuLaN model. The rest (SoundStream, w2vBERT, and the 3 Transformers)were trained using music data.

Screenshot from 2023-02-09 17-34-08

However, the datasets for MuLaN(according to their original paper), and for MusicLM Transformers are not public.

epinnock commented 1 year ago

Thanks for updating this @ukemamaster. are there currently pretrained implementations of soundstream and w2v-BERT. also I saw this paper recently that implements music generation using MuLaN and diffusion. https://google-research.github.io/noise2music/noise2music.pdf. This is fairly outside my area of expertise, also instead of muLaN could you use this? https://github.com/seungheondoh/music-text-representation/ @lucidrains @ukemamaster

lucidrains commented 1 year ago

thanks for the education @lucidrains !

fwiw i intend to tinker with this mulan implementation (as large a batch size as possible) to see where i get, until another option is available :-)

to make sure i understand, when you talk about integration with open-clip, are you thinking of using open-clip as an implementation for parts of mulan here, or thinking of open-clip as an organization that has the means to create a publicly available model...?

yup both! we can train there, i can build wrappers that have common interfaces that support their pretrained models, as i've done for dalle2

their group, Laion, is a very legit crowd! many many successes by now

actually, i don't think open-clip is under Laion, just that they use the Laion dataset, and have a lot of infra support from Laion + Stability

lucidrains commented 1 year ago

Hi, just wanted to ask is it possible to integrate the pretrained model from this paper instead of mulan? https://github.com/seungheondoh/music-text-representation/ @lucidrains

i'm not sure, is it any good? honestly, i think there isn't a great open sourced foundation model for text-audio or text-music yet, or i've heard about it by now

lucidrains commented 1 year ago

@epinnock The original paper says they used pretrained weights only for MuLaN model. The rest (SoundStream, w2vBERT, and the 3 Transformers)were trained using music data.

However, the datasets for MuLaN(according to their original paper), and for MusicLM Transformers are not public.

yeah, you all should join Laion and start thinking about data in a collaborative manner

realistically, no one has had the level of success of Laion. i mean, they even have outstanding paper award at the last neurips. it would be wise to simply join their group at this point

lucidrains commented 1 year ago

ok, i'm going to close this issue, as it has been addressed

ukemamaster commented 1 year ago

@djqualia Have you tried training MuLaN with some real data? If yes, which data? and how do you convert text into tokens, to feed into the model?

djqualia commented 1 year ago

fwiw i have not yet gotten around to training mulan. i have limited hardware and am still trying to train soundstream/hubert first (while putting together a larger dataset) for music sources, consider FMA (free music archive) and jamendo and AudioSet.

lucidrains / musiclm-pytorch

Exception when attempting to train #9