Closed djqualia closed 1 year ago
@djqualia i'm excited for you to try it! :smile:
is your dataset at any point not returning a tuple, but a single value? could you possibly show me your script?
import torch
from musiclm_pytorch import MusicLM, MuLaNTrainer
from musiclm_pytorch import MuLaN, AudioSpectrogramTransformer, TextTransformer, MuLaNEmbedQuantizer
audio_transformer = AudioSpectrogramTransformer(
dim = 512,
depth = 6,
heads = 8,
dim_head = 64,
spec_n_fft = 128,
spec_win_length = 24,
spec_aug_stretch_factor = 0.8
)
text_transformer = TextTransformer(
dim = 512,
depth = 6,
heads = 8,
dim_head = 64
)
mulan = MuLaN(
audio_transformer = audio_transformer,
text_transformer = text_transformer
)
from torch.utils.data import Dataset
class MockTextAudioDataset(Dataset):
def __init__(self, length = 100, audio_length = 320 * 32):
super().__init__()
self.audio_length = audio_length
self.len = length
def __len__(self):
return self.len
def __getitem__(self, idx):
from random import randrange
mock_audio = torch.randn(randrange(self.audio_length // 2, self.audio_length))
mock_text = torch.randint(0, 12, (256,)).long()
return mock_caption, mock_audio
trainer = MuLaNTrainer(
mulan = mulan,
dataset = MockTextAudioDataset(),
batch_size = 4
)
trainer.train()
This seems to run fine for me
I figured out that the key difference in my setup, which triggers this bug, is setting batch_size = 1. Then you should be able to reproduce AFAICT
Do you think it's expected for the (contrastive?) loss to be both positive/negative?
spectrogram yielded shape of (65, 26667), but had to be cropped to (64, 26656) to be patchified for transformer 0: loss: -3.088079392910004e-05 0: saving model to /mnt/c/models/audiolm/mulantest 1: loss: -0.008366793394088745 2: loss: -0.00033165328204631805 3: loss: 0.0015411581844091415 4: loss: -0.0021463483572006226 5: loss: 0.003450023476034403 6: loss: -0.001429535448551178 7: loss: -0.001284077763557434 8: loss: 0.01317517552524805 9: loss: 0.07162240147590637 10: loss: -0.0007345117628574371 11: loss: 0.00026188790798187256 12: loss: 0.016698703169822693 13: loss: 0.0058104507625103 14: loss: 0.00037843361496925354 15: loss: -0.00018790364265441895 16: loss: -0.0001080445945262909 17: loss: -0.001908978447318077 18: loss: -0.00023999251425266266 19: loss: 0.03030853345990181 20: loss: -0.00021585077047348022 21: loss: 0.0001592119224369526 22: loss: -0.00013920455239713192 23: loss: -0.0021669212728738785 24: loss: 0.00395401194691658 25: loss: -5.50001859664917e-05 26: loss: -0.0026106592267751694 27: loss: -0.0008263345807790756 28: loss: 0.0012336960062384605 29: loss: 5.2521005272865295e-05 30: loss: -0.005257192999124527
@djqualia hey! fixed the issues, it had to do with batch size of 1 as you figured out
so in contrastive learning, one is forcing the network to play a game of matching up the pair of modalities (text audio in this case), so a batch size of 1 would mean there's nothing to play
for the negative numbers, i believe it is a result of the paper going with decoupled contrastive learning. it is my first time seeing this technique used in the wild, and i believe the loss may be negative
you can try turning it off and it should be all positive values
@djqualia so realistically, MuLaN won't be trained with the code in this repository
we should rely on making open-clip audio compatible
recently they managed to get CoCa working, and so we can easily reach SOTA for audio clip training (think audio clip + coca + some other features in that repository), and also get a great audio captioner to boot
@lucidrains So how should we train the MuLan model? Is it ok if we use batch_size > 1
?
@ukemamaster yup, you need very high batch sizes, like 128-256
this is why the job is better suited for a group like open clip
that group, affiliated with Laion, has done many SOTA open sourced CLIP models by now
you can try tinkering with it on a small scale though
@lucidrains OK.
And which dataset did you use to train the MuLaN
and the AudioLM
transformers?
@ukemamaster they are not trained at all yet, outside of google
if you are interested in having something trained, i would recommend joining Laion and getting in touch with Marianne. She is working on amassing a dataset
@ukemamaster i will put my back into getting MuLaN integrated into open-clip next week
@lucidrains Yes. I am very interested in training and reproducing google's results. I will be waiting for the MuLaN integration. Thanks
thanks for the education @lucidrains !
fwiw i intend to tinker with this mulan implementation (as large a batch size as possible) to see where i get, until another option is available :-)
to make sure i understand, when you talk about integration with open-clip, are you thinking of using open-clip as an implementation for parts of mulan here, or thinking of open-clip as an organization that has the means to create a publicly available model...?
Hi, just wanted to ask is it possible to integrate the pretrained model from this paper instead of mulan? https://github.com/seungheondoh/music-text-representation/ @lucidrains
@djqualia Which dataset you are using to train MuLan
? and the AudioLM
transformers?
@djqualia @lucidrains It seems like MuLan
needs a dataset containing audio and text in pairs, like:
sample_1 = [music_audio_1, text_1]
sample_2 = [music_audio_2, text_2]
.
.
.
But the SoundStream and the 3 transformers for the AudioLM
can be trained using audios only.
So my question is:
Is it OK to train MuLaN
using music dataset, and SoundStream plus the 3 transformers using a speech (non-music) dataset? yet conditioning them on the MuLaNEmbedQuantizer
having MuLaN
trained on music data?
OR all need to be trained on the same music dataset?
The original paper said they used pretrained models which I believe all had different datasets
@epinnock The original paper says they used pretrained weights only for MuLaN
model. The rest (SoundStream, w2vBERT, and the 3 Transformers)
were trained using music data.
However, the datasets for MuLaN
(according to their original paper), and for MusicLM
Transformers
are not public.
Thanks for updating this @ukemamaster. are there currently pretrained implementations of soundstream and w2v-BERT. also I saw this paper recently that implements music generation using MuLaN and diffusion. https://google-research.github.io/noise2music/noise2music.pdf. This is fairly outside my area of expertise, also instead of muLaN could you use this? https://github.com/seungheondoh/music-text-representation/ @lucidrains @ukemamaster
thanks for the education @lucidrains !
fwiw i intend to tinker with this mulan implementation (as large a batch size as possible) to see where i get, until another option is available :-)
to make sure i understand, when you talk about integration with open-clip, are you thinking of using open-clip as an implementation for parts of mulan here, or thinking of open-clip as an organization that has the means to create a publicly available model...?
yup both! we can train there, i can build wrappers that have common interfaces that support their pretrained models, as i've done for dalle2
their group, Laion, is a very legit crowd! many many successes by now
actually, i don't think open-clip is under Laion, just that they use the Laion dataset, and have a lot of infra support from Laion + Stability
Hi, just wanted to ask is it possible to integrate the pretrained model from this paper instead of mulan? https://github.com/seungheondoh/music-text-representation/ @lucidrains
i'm not sure, is it any good? honestly, i think there isn't a great open sourced foundation model for text-audio or text-music yet, or i've heard about it by now
@epinnock The original paper says they used pretrained weights only for
MuLaN
model. The rest(SoundStream, w2vBERT, and the 3 Transformers)
were trained using music data.However, the datasets for
MuLaN
(according to their original paper), and forMusicLM
Transformers
are not public.
yeah, you all should join Laion and start thinking about data in a collaborative manner
realistically, no one has had the level of success of Laion. i mean, they even have outstanding paper award at the last neurips. it would be wise to simply join their group at this point
ok, i'm going to close this issue, as it has been addressed
@djqualia Have you tried training MuLaN with some real data? If yes, which data? and how do you convert text into tokens, to feed into the model?
fwiw i have not yet gotten around to training mulan. i have limited hardware and am still trying to train soundstream/hubert first (while putting together a larger dataset) for music sources, consider FMA (free music archive) and jamendo and AudioSet.
i'm excited to try this out!
i attempted to train, feeding in a MockTextAudioDataset similar to the example on AudioLM's page (that worked with the semantic trainer there), but encountered the following exception: TypeError: 'int' object is not iterable
Full stack trace, in case it helps:
File "train_mulan.py", line 60, in
trainer.train()
File "<@beartype(musiclm_pytorch.trainer.MuLaNTrainer.train) at 0x7ff0e221f160>", line 30, in train
File "/mnt/c/audio-ml-workspace/musiclm/musiclm_pytorch/trainer.py", line 363, in train
logs = self.train_step()
File "/mnt/c/audio-ml-workspace/musiclm/musiclm_pytorch/trainer.py", line 330, in train_step
data_kwargs = self.data_tuple_to_kwargs(next(self.dl_iter))
File "/mnt/c/audio-ml-workspace/musiclm/musiclm_pytorch/trainer.py", line 57, in cycle
for data in dl:
File "/home/qualia/anaconda3/envs/audiolm/lib/python3.8/site-packages/accelerate/data_loader.py", line 375, in iter
current_batch = next(dataloader_iter)
File "/home/qualia/anaconda3/envs/audiolm/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in next
data = self._next_data()
File "/home/qualia/anaconda3/envs/audiolm/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/qualia/anaconda3/envs/audiolm/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch
return self.collate_fn(data)
File "/mnt/c/audio-ml-workspace/musiclm/musiclm_pytorch/trainer.py", line 146, in inner
output = fn(datum)
File "/mnt/c/audio-ml-workspace/musiclm/musiclm_pytorch/trainer.py", line 156, in curtail_to_shortest_collate
min_len = min(*[datum.shape[0] for datum in data])
TypeError: 'int' object is not iterable