Closed BigFatMan312312 closed 1 year ago
https://github.com/aaronannecchiarico/musiclm-pytorch/blob/main/musiclm-pytorch-demo.ipynb
@hdgjdhgjutzujgfh @BigFatMan312312 - This is my attempt at that, likely I am messing up somewhere or misunderstanding something as I am getting just noise as my output. Training with a dataset of 32 and 2 resulting in the same sound.
@BigFatMan312312
Why would you train MuLaN
with random data?
# get a ton of pairs and train
wavs = torch.randn(2, 512)
texts = torch.randint(0, 20000, (2, 256))
@ukemamaster you're right. I have updated that notebook with a more in progress version, are you able to view that? latest
After attempting to adapt that dataset into one I can pass into the trainer I encounter:
RuntimeError: stack expects each tensor to be equal size, but got [2, 958728] at entry 0 and [2, 441352] at entry 3
@aaronannecchiarico may be your audios do not have the same shape.
I confirm that should be there an error with youtube dataset, because I have the same error
RuntimeError: stack expects each tensor to be equal size, but got [2, 958728] at entry 0 and [2, 441352] at entry 3
https://gist.github.com/creative-cranels/0cd692c9dd23a4c4dcb6fb6786eaa303
Hey all, tried to run training and music generation in colab. The sound it generated is just a noise. But at least there are no errors in training part. Can you guys checkout my notebook and help me to understand why it generated just random noise?
@aaronannecchiarico took a lot of scripts from your notebook ☮️
@ukemamaster Can you please provide a script to train MuLaN
along with the necessary dataset that needs to be used?
@creative-cranels @aaronannecchiarico Hey guys!
I've been trying to run both your colabs in a jupyter notebook locally, see if I can help out getting this working
In both notebooks under the line "ds = main('./music_data', num_proc=2, limit=30, writer_batch_size=4)"
I keep getting a "NameError: name 'os' is not defined"
That seems to stem from arrow_dataset.py and pool.py ....any ideas?
@deepak-newzera Have a look at this.
- You must load real wavfiles and the correspoding text descreption in the
MockTextAudioDataset
class.- You will also need to convert the "text" data into some kind of tokens like this.
At the moment i am using the 5.5k test data from MusicCaps.
@ukemamaster Thanks for your response.
In the musiccaps-public.csv
file we are concerned only about ytid
and caption
columns, right? And the caption
has to be tokenized in this way output = tokenizer.encode(caption)
. Am I right?
Now how to represent the audio wav files and their corresponding tokens and how to load them in the MockTextAudioDataset
in the code? Please help me with this.
@ukemamaster Thanks for your response. In the
musiccaps-public.csv
file we are concerned only aboutytid
andcaption
columns, right? And thecaption
has to be tokenized in this wayoutput = tokenizer.encode(caption)
. Am I right?
Yes
Now how to represent the audio wav files and their corresponding tokens and how to load them in the
MockTextAudioDataset
in the code? Please help me with this.
- Load the wav file using
wav = scipy.io.wavfil.read()
orwav = torchaudio.load()
or may belibrosa
. Its upto you.text = tokenizer.encode(caption)
will give you a list of numbers.
Return text
and wav
in the __getitem__(self. idx)
method of the dataclass.
Example:
import csv
def read_data(filename):
with open(filename, 'r') as read_obj:
csv_reader = csv.reader(read_obj)
list_of_rows = list(csv_reader)
list_of_rows.pop(0)
free_form_dict = {}
for l in list_of_rows:
free_form_dict[l[0]] = l[5]
return free_form_dict
class MockTextAudioDataset(Dataset):
def __init__(self, list_path):
super().__init__()
self.free_form_dict = read_data(list_path)
self.data_list = list(self.free_form_dict.keys())
self.len = len(self.data_list)
def __len__(self):
return self.len
def __getitem__(self, idx):
filename = self.data_list[idx]
rate, audio = scipy.io.wavfile.read(filename)
text = self.free_form_dict[filename]
text_in_numbers = np.array(tokenizer.encode(text))
return torch.FloatTensor(audio), torch.LongTensor(text_in_numbers)
@ukemamaster I am facing an issue with the tokenizer
.
tokenizer = Tokenizer.from_file("data/tokenizer-wiki.json")
. This is how I initialized it and then I tried to use it as follows:
output = tokenizer.encode('someone is playing a high pitched melody on a steel drum. The file is of poor audio-quality.')
Here the output
I got is Encoding(num_tokens=21, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
Then I printed np.array(output)
. It is showing up as array(Encoding(num_tokens=21, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]), dtype=object)
Then I tried to print torch.LongTensor(np.array(output))
. It is giving the below error
TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.
Is it the issue with my tokenizer
? If yes please help me resolve this.
@deepak-newzera Are you sure you have initialized your tokenizer correctly?
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
output = tokenizer.encode('someone is playing a high pitched melody on a steel drum. The file is of poor audio-quality.')
print(output)
#prints:
#[101, 1800, 1110, 1773, 170, 1344, 7813, 11961, 1113, 170, 3649, 7505, 119, 1109, 4956, 1110, 1104, 2869, 6056, 118, 3068, 119, 102]
@deepak-newzera Are you sure you have initialized your tokenizer correctly?
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") output = tokenizer.encode('someone is playing a high pitched melody on a steel drum. The file is of poor audio-quality.') print(output) #prints: #[101, 1800, 1110, 1773, 170, 1344, 7813, 11961, 1113, 170, 3649, 7505, 119, 1109, 4956, 1110, 1104, 2869, 6056, 118, 3068, 119, 102]
@ukemamaster I was using a different tokenizer
. Now I got it. Thank you.
RuntimeError: stack expects each tensor to be equal size, but got [441696, 2] at entry 0 and [441696] at entry 1
This is the error that I am facing now while training MuLaN
. Is there a way to make all the tensors to be equal in size?
Any idea?
@deepak-newzera check if the audio from rate, audio = scipy.io.wavfile.read(filename)
is mono (1 channel) or stereo (2 channels).
If stereo, take just 1 channel. and the shape will be [1,-1]
@deepak-newzera check if the audio from
rate, audio = scipy.io.wavfile.read(filename)
is mono (1 channel) or stereo (2 channels). If stereo, take just 1 channel. and the shape will be[1,-1]
if audio.ndim == 2 and audio.shape[1] == 2:
# Convert two channels to one channel by taking the mean along the channel dimension
audio = np.mean(audio, axis=1, keepdims=True)
This is how I am going to do it. Is this ok?
@ukemamaster Thanks for your response. In the
musiccaps-public.csv
file we are concerned only aboutytid
andcaption
columns, right? And thecaption
has to be tokenized in this wayoutput = tokenizer.encode(caption)
. Am I right?Yes
Now how to represent the audio wav files and their corresponding tokens and how to load them in the
MockTextAudioDataset
in the code? Please help me with this.
- Load the wav file using
wav = scipy.io.wavfil.read()
orwav = torchaudio.load()
or may belibrosa
. Its upto you.text = tokenizer.encode(caption)
will give you a list of numbers.Return
text
andwav
in the__getitem__(self. idx)
method of the dataclass.Example:
import csv def read_data(filename): with open(filename, 'r') as read_obj: csv_reader = csv.reader(read_obj) list_of_rows = list(csv_reader) list_of_rows.pop(0) free_form_dict = {} for l in list_of_rows: free_form_dict[l[0]] = l[5] return free_form_dict class MockTextAudioDataset(Dataset): def __init__(self, list_path): super().__init__() self.free_form_dict = read_data(list_path) self.data_list = list(self.free_form_dict.keys()) self.len = len(self.data_list) def __len__(self): return self.len def __getitem__(self, idx): filename = self.data_list[idx] rate, audio = scipy.io.wavfile.read(filename) text = self.free_form_dict[filename] text_in_numbers = np.array(tokenizer.encode(text)) return torch.FloatTensor(audio), torch.LongTensor(text_in_numbers)
are you sure the returned pair shouldn't be the other way around? i.e.
torch.LongTensor(text_in_numbers), torch.FloatTensor(audio)
?
@lucidrains It can be in any order, no?
@lucidrains @ukemamaster I tried to train MuLaN in GPU. I am facing the below error. I am using the Musiccaps dataset that contains 5.5k audio-text pairs for the training.
OutOfMemoryError: CUDA out of memory. Tried to allocate 213.98 GiB (GPU 0; 23.64 GiB total capacity; 10.49 GiB already allocated; 12.03 GiB free; 10.60 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I also tried running it on a CPU with 252GB RAM. In this case, it is occupying the entire memory and then the python kernel is becoming dead. Is there a way to optimize this memory utilization? I also tried reducing the batch size to 2. But still, it could not be optimized.
@ukemamaster My training is not even getting started after the statement spectrogram yielded shape of (65, 79895), but had to be cropped to (64, 79888) to be patchified for transformer
I am directly getting the above-mentioned OutOfMemoryError:
after this statement. What are the hardware requirements for this training?
@Ace71425 有具体报错信息吗?可能是你的 os 包是在另一个函数中引用的,或者你试试吧 "import os" 放在其他 "import ***" 下面
https://gist.github.com/creative-cranels/0cd692c9dd23a4c4dcb6fb6786eaa303
Hey all, tried to run training and music generation in colab.大家好,试着在colab运行培训和音乐生成。 The sound it generated is just a noise. But at least there are no errors in training part.它发出的声音只是噪音。但至少在训练部分没有错误。 Can you guys checkout my notebook and help me to understand why it generated just random noise?你们能看看我的笔记本电脑,帮我理解一下为什么它只产生随机噪音吗?
@aaronannecchiarico took a lot of scripts from your notebook ☮️ @aaronannecchiarico 从你的笔记本上取了很多脚本 ☮️
@creative-cranels 这个应该能解决你的问题,因为你是用的测试数据
@deepak-newzera
if audio.ndim == 2 and audio.shape[1] == 2:
# Convert two channels to one channel by taking the mean along the channel dimension
audio = np.mean(audio, axis=1, keepdims=True)
There are still exceptions when using this method.
Traceback (most recent call last):
File "D:\Anaconda\envs\musiclm-pytorch\lib\site-packages\accelerate\data_loader.py", line 378, in __iter__
current_batch = next(dataloader_iter)
File "D:\Anaconda\envs\musiclm-pytorch\lib\site-packages\torch\utils\data\dataloader.py", line 628, in __next__
data = self._next_data()
File "D:\Anaconda\envs\musiclm-pytorch\lib\site-packages\torch\utils\data\dataloader.py", line 671, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "D:\Anaconda\envs\musiclm-pytorch\lib\site-packages\torch\utils\data\_utils\fetch.py", line 61, in fetch
return self.collate_fn(data)
File "E:\WorkSpace\pyWorkSpace\musiclm-pytorch\musiclm_pytorch\trainer.py", line 122, in inner
output = fn(datum)
File "E:\WorkSpace\pyWorkSpace\musiclm-pytorch\musiclm_pytorch\trainer.py", line 134, in curtail_to_shortest_collate
return torch.stack(data)
RuntimeError: stack expects each tensor to be equal size, but got [958728, 1] at entry 0 and [958728] at entry 1
Did you use it normally
just combine SemanticTransformerTrainer with this and you got it https://github.com/nateraw/download-musiccaps-dataset https://www.kaggle.com/datasets/googleai/musiccaps
but a colab for custume songs and genres would be nice
herburt model https://github.com/facebookresearch/fairseq/tree/main/examples/hubert