Please make a google colab, i cant really do anything. Plus i dont know how to get the herburt model

hdgjdhgjutzujgfh commented 1 year ago

just combine SemanticTransformerTrainer with this and you got it https://github.com/nateraw/download-musiccaps-dataset https://www.kaggle.com/datasets/googleai/musiccaps

but a colab for custume songs and genres would be nice

herburt model https://github.com/facebookresearch/fairseq/tree/main/examples/hubert

aaronannecchiarico commented 1 year ago

https://github.com/aaronannecchiarico/musiclm-pytorch/blob/main/musiclm-pytorch-demo.ipynb

@hdgjdhgjutzujgfh @BigFatMan312312 - This is my attempt at that, likely I am messing up somewhere or misunderstanding something as I am getting just noise as my output. Training with a dataset of 32 and 2 resulting in the same sound.

ukemamaster commented 1 year ago

@BigFatMan312312 Why would you train MuLaN with random data?

# get a ton of  pairs and train

wavs = torch.randn(2, 512)
texts = torch.randint(0, 20000, (2, 256))

aaronannecchiarico commented 1 year ago

@ukemamaster you're right. I have updated that notebook with a more in progress version, are you able to view that? latest

After attempting to adapt that dataset into one I can pass into the trainer I encounter: RuntimeError: stack expects each tensor to be equal size, but got [2, 958728] at entry 0 and [2, 441352] at entry 3

ukemamaster commented 1 year ago

@aaronannecchiarico may be your audios do not have the same shape.

M4iKZ commented 1 year ago

I confirm that should be there an error with youtube dataset, because I have the same error RuntimeError: stack expects each tensor to be equal size, but got [2, 958728] at entry 0 and [2, 441352] at entry 3

creative-cranels commented 1 year ago

https://gist.github.com/creative-cranels/0cd692c9dd23a4c4dcb6fb6786eaa303

Hey all, tried to run training and music generation in colab. The sound it generated is just a noise. But at least there are no errors in training part. Can you guys checkout my notebook and help me to understand why it generated just random noise?

@aaronannecchiarico took a lot of scripts from your notebook ☮️

deepak-newzera commented 1 year ago

@ukemamaster Can you please provide a script to train MuLaN along with the necessary dataset that needs to be used?

ukemamaster commented 1 year ago

@deepak-newzera Have a look at this.

You must load real wavfiles and the correspoding text descreption in the MockTextAudioDataset class.
You will also need to convert the "text" data into some kind of tokens like this.

At the moment i am using the 5.5k test data from MusicCaps.

Ace71425 commented 1 year ago

@creative-cranels @aaronannecchiarico Hey guys!

I've been trying to run both your colabs in a jupyter notebook locally, see if I can help out getting this working

In both notebooks under the line "ds = main('./music_data', num_proc=2, limit=30, writer_batch_size=4)"

I keep getting a "NameError: name 'os' is not defined"

That seems to stem from arrow_dataset.py and pool.py ....any ideas?

deepak-newzera commented 1 year ago

@deepak-newzera Have a look at this.

You must load real wavfiles and the correspoding text descreption in the MockTextAudioDataset class.

You will also need to convert the "text" data into some kind of tokens like this.

At the moment i am using the 5.5k test data from MusicCaps.

@ukemamaster Thanks for your response. In the musiccaps-public.csv file we are concerned only about ytid and caption columns, right? And the caption has to be tokenized in this way output = tokenizer.encode(caption). Am I right?

Now how to represent the audio wav files and their corresponding tokens and how to load them in the MockTextAudioDataset in the code? Please help me with this.

ukemamaster commented 1 year ago

@ukemamaster Thanks for your response. In the musiccaps-public.csv file we are concerned only about ytid and caption columns, right? And the caption has to be tokenized in this way output = tokenizer.encode(caption). Am I right?

Yes

Now how to represent the audio wav files and their corresponding tokens and how to load them in the MockTextAudioDataset in the code? Please help me with this.

Load the wav file using wav = scipy.io.wavfil.read() or wav = torchaudio.load() or may be librosa. Its upto you.

text = tokenizer.encode(caption) will give you a list of numbers.

Return text and wav in the __getitem__(self. idx)method of the dataclass.

Example:

import csv

def read_data(filename):
    with open(filename, 'r') as read_obj:
         csv_reader = csv.reader(read_obj)
         list_of_rows = list(csv_reader)
         list_of_rows.pop(0)
         free_form_dict = {}
         for l in list_of_rows:
             free_form_dict[l[0]] = l[5]
         return free_form_dict

class MockTextAudioDataset(Dataset):
    def __init__(self, list_path):
        super().__init__()
        self.free_form_dict = read_data(list_path)
        self.data_list = list(self.free_form_dict.keys())
        self.len = len(self.data_list)

    def __len__(self):
        return self.len

    def __getitem__(self, idx):
        filename = self.data_list[idx]
        rate, audio = scipy.io.wavfile.read(filename)
        text = self.free_form_dict[filename]
        text_in_numbers = np.array(tokenizer.encode(text))
        return torch.FloatTensor(audio), torch.LongTensor(text_in_numbers)

deepak-newzera commented 1 year ago

@ukemamaster I am facing an issue with the tokenizer. tokenizer = Tokenizer.from_file("data/tokenizer-wiki.json"). This is how I initialized it and then I tried to use it as follows:

output = tokenizer.encode('someone is playing a high pitched melody on a steel drum. The file is of poor audio-quality.') Here the output I got is Encoding(num_tokens=21, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

Then I printed np.array(output). It is showing up as array(Encoding(num_tokens=21, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]), dtype=object)

Then I tried to print torch.LongTensor(np.array(output)). It is giving the below error TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

Is it the issue with my tokenizer? If yes please help me resolve this.

ukemamaster commented 1 year ago

@deepak-newzera Are you sure you have initialized your tokenizer correctly?

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
output = tokenizer.encode('someone is playing a high pitched melody on a steel drum. The file is of poor audio-quality.')
print(output)
#prints:
#[101, 1800, 1110, 1773, 170, 1344, 7813, 11961, 1113, 170, 3649, 7505, 119, 1109, 4956, 1110, 1104, 2869, 6056, 118, 3068, 119, 102]

deepak-newzera commented 1 year ago

@deepak-newzera Are you sure you have initialized your tokenizer correctly?

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
output = tokenizer.encode('someone is playing a high pitched melody on a steel drum. The file is of poor audio-quality.')
print(output)
#prints:
#[101, 1800, 1110, 1773, 170, 1344, 7813, 11961, 1113, 170, 3649, 7505, 119, 1109, 4956, 1110, 1104, 2869, 6056, 118, 3068, 119, 102]

@ukemamaster I was using a different tokenizer. Now I got it. Thank you.

deepak-newzera commented 1 year ago

RuntimeError: stack expects each tensor to be equal size, but got [441696, 2] at entry 0 and [441696] at entry 1 This is the error that I am facing now while training MuLaN. Is there a way to make all the tensors to be equal in size? Any idea?

ukemamaster commented 1 year ago

@deepak-newzera check if the audio from rate, audio = scipy.io.wavfile.read(filename) is mono (1 channel) or stereo (2 channels). If stereo, take just 1 channel. and the shape will be [1,-1]

deepak-newzera commented 1 year ago

@deepak-newzera check if the audio from rate, audio = scipy.io.wavfile.read(filename) is mono (1 channel) or stereo (2 channels). If stereo, take just 1 channel. and the shape will be [1,-1]

if audio.ndim == 2 and audio.shape[1] == 2:
    # Convert two channels to one channel by taking the mean along the channel dimension
    audio = np.mean(audio, axis=1, keepdims=True)

This is how I am going to do it. Is this ok?

beschulz commented 1 year ago

@ukemamaster Thanks for your response. In the musiccaps-public.csv file we are concerned only about ytid and caption columns, right? And the caption has to be tokenized in this way output = tokenizer.encode(caption). Am I right?

Yes

Now how to represent the audio wav files and their corresponding tokens and how to load them in the MockTextAudioDataset in the code? Please help me with this.

Load the wav file using wav = scipy.io.wavfil.read() or wav = torchaudio.load() or may be librosa. Its upto you.

text = tokenizer.encode(caption) will give you a list of numbers.

Return text and wav in the __getitem__(self. idx)method of the dataclass.

Example:
import csv

def read_data(filename):
    with open(filename, 'r') as read_obj:
         csv_reader = csv.reader(read_obj)
         list_of_rows = list(csv_reader)
         list_of_rows.pop(0)
         free_form_dict = {}
         for l in list_of_rows:
             free_form_dict[l[0]] = l[5]
         return free_form_dict

class MockTextAudioDataset(Dataset):
    def __init__(self, list_path):
        super().__init__()
        self.free_form_dict = read_data(list_path)
        self.data_list = list(self.free_form_dict.keys())
        self.len = len(self.data_list)

    def __len__(self):
        return self.len

    def __getitem__(self, idx):
        filename = self.data_list[idx]
        rate, audio = scipy.io.wavfile.read(filename)
        text = self.free_form_dict[filename]
        text_in_numbers = np.array(tokenizer.encode(text))
        return torch.FloatTensor(audio), torch.LongTensor(text_in_numbers)

are you sure the returned pair shouldn't be the other way around? i.e. torch.LongTensor(text_in_numbers), torch.FloatTensor(audio) ?

ukemamaster commented 1 year ago

@lucidrains It can be in any order, no?

deepak-newzera commented 1 year ago

@lucidrains @ukemamaster I tried to train MuLaN in GPU. I am facing the below error. I am using the Musiccaps dataset that contains 5.5k audio-text pairs for the training.

OutOfMemoryError: CUDA out of memory. Tried to allocate 213.98 GiB (GPU 0; 23.64 GiB total capacity; 10.49 GiB already allocated; 12.03 GiB free; 10.60 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I also tried running it on a CPU with 252GB RAM. In this case, it is occupying the entire memory and then the python kernel is becoming dead. Is there a way to optimize this memory utilization? I also tried reducing the batch size to 2. But still, it could not be optimized.

deepak-newzera commented 1 year ago

@ukemamaster My training is not even getting started after the statement spectrogram yielded shape of (65, 79895), but had to be cropped to (64, 79888) to be patchified for transformer

I am directly getting the above-mentioned OutOfMemoryError: after this statement. What are the hardware requirements for this training?

Mingxiangyu commented 1 year ago

Ace71425

@Ace71425 有具体报错信息吗？可能是你的 os 包是在另一个函数中引用的，或者你试试吧 "import os" 放在其他 "import ***" 下面

Mingxiangyu commented 1 year ago

https://gist.github.com/creative-cranels/0cd692c9dd23a4c4dcb6fb6786eaa303

Hey all, tried to run training and music generation in colab.大家好，试着在colab运行培训和音乐生成。 The sound it generated is just a noise. But at least there are no errors in training part.它发出的声音只是噪音。但至少在训练部分没有错误。 Can you guys checkout my notebook and help me to understand why it generated just random noise?你们能看看我的笔记本电脑，帮我理解一下为什么它只产生随机噪音吗？

@aaronannecchiarico took a lot of scripts from your notebook ☮️ @aaronannecchiarico 从你的笔记本上取了很多脚本 ☮️

@creative-cranels 这个应该能解决你的问题，因为你是用的测试数据

Mingxiangyu commented 1 year ago

@deepak-newzera

 if audio.ndim == 2 and audio.shape[1] == 2:
    # Convert two channels to one channel by taking the mean along the channel dimension
    audio = np.mean(audio, axis=1, keepdims=True)

There are still exceptions when using this method.


Traceback (most recent call last):
  File "D:\Anaconda\envs\musiclm-pytorch\lib\site-packages\accelerate\data_loader.py", line 378, in __iter__
    current_batch = next(dataloader_iter)
  File "D:\Anaconda\envs\musiclm-pytorch\lib\site-packages\torch\utils\data\dataloader.py", line 628, in __next__
    data = self._next_data()
  File "D:\Anaconda\envs\musiclm-pytorch\lib\site-packages\torch\utils\data\dataloader.py", line 671, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "D:\Anaconda\envs\musiclm-pytorch\lib\site-packages\torch\utils\data\_utils\fetch.py", line 61, in fetch
    return self.collate_fn(data)
  File "E:\WorkSpace\pyWorkSpace\musiclm-pytorch\musiclm_pytorch\trainer.py", line 122, in inner
    output = fn(datum)
  File "E:\WorkSpace\pyWorkSpace\musiclm-pytorch\musiclm_pytorch\trainer.py", line 134, in curtail_to_shortest_collate
    return torch.stack(data)
RuntimeError: stack expects each tensor to be equal size, but got [958728, 1] at entry 0 and [958728] at entry 1

Did you use it normally

lucidrains / musiclm-pytorch

Please make a google colab, i cant really do anything. Plus i dont know how to get the herburt model #8