Closed salman1851 closed 1 year ago
This is what my labels file looks like.
I increased the size of my dataset by a factor of 30; I now have around 4000 examples. I figured the code was not getting enough examples in the training pipeline. The terminal is not showing the same error, but it is still stuck at the first epoch. Here's the output.
Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_v', 'wav2vec2.encoder.pos_conv_embed.conv.weight_g']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized because the shapes did not match:
- lm_head.bias: found shape torch.Size([32]) in the checkpoint and torch.Size([29]) in the model instantiated
- lm_head.weight: found shape torch.Size([32, 768]) in the checkpoint and torch.Size([29, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Cuda Device Available.
INFO:WarmupCosineDecay:Epoch 1 - Learning Rate: 1e-08
0%| | 0/527 [00:00<?, ?it/s]/home/ee/anaconda3/lib/python3.9/site-packages/mltu/transformers.py:234: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
return padded_audios, np.array(label)
Epoch 1 - loss: 27.2745 - CER: 3.4600 - WER: 1.0620: 1%| | 6/527 [00:07<05:46,
I would really appreciate your help.
Hey, it seems that it started training, if my example works, this also should work for you. I hope you train on gpu? try to decrease batch_size to 2, and check whether it trains. Maybe its really slow
I changed the batch_size to 2, and it seems to have reduced the total time per epoch, but the issue is still there. It seems that the examples in the first epoch run for the first few seconds or so, and then the pipeline gets stuck. Yes, I am training on RTX 3090. The CUDA device gets detected by PyTorch in the code. And, your example worked without any hitch.
Try to run something like this in debug mode:
for data in tqdm(data_provider):
pass (or something else)
There might be a problem that librosa library hangs while trying to read your audio. What os you use? So, try to check whether you can read your audio or not
When I run the tqdm code (that you mentioned above), the console throws a type error.
import tqdm
for data in tqdm(data_provider):
print(data)
Traceback (most recent call last):
File "/tmp/ipykernel_148805/4044221648.py", line 1, in <module>
for data in tqdm(data_provider):
TypeError: 'module' object is not callable
I use Ubuntu 20.04. Here are the specifics:
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
I tried reading all the audio files in the example dataset and my own dataset with the librosa library in debug mode and the system has no problem reading the files.
Ubuntu should be ok, you need to do the following:
from tqdm import tqdm
for data in tqdm(data_provider):
print(data)
If you can read data from data provider, need to investigate further (best with debug mode and to find what should be the cause).
So, I decreased the batch size to 1 on the original dataset (150 training examples) and it's working now. I think there's an optimal batch size number that needs to be set for different sizes of dataset. Anyway, thanks for your help!
Thats strange, because I alsto trained with RTX 3090 with batch_size of 8 and everything was fine. Check nvtop
command in terminal, how much GPU ram is consumed during training
This is what the nvtop output looks like when I run the training script on my larger dataset (4560 examples) at a batch size of 8. The console output gets stuck after a few examples in the first epoch.
Actually, I mistyped earlier. My GPU model is 2080, not 3090.
Yes, it seems that cpu is idle and not doing anything for some reason...
It would be really interesting to find where is the problem. Maybe you would like to upload your training script with part of the dataset? so I could check this out
This is the training script. I only changed the path of the dataset.
import os
import tarfile
import pandas as pd
from tqdm import tqdm
from io import BytesIO
from urllib.request import urlopen
import torch
from torch import nn
from transformers import Wav2Vec2ForCTC
import torch.nn.functional as F
from mltu.torch.model import Model
from mltu.torch.losses import CTCLoss
from mltu.torch.dataProvider import DataProvider
from mltu.torch.metrics import CERMetric, WERMetric
from mltu.torch.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard, Model2onnx, WarmupCosineDecay
from mltu.augmentors import RandomAudioNoise, RandomAudioPitchShift, RandomAudioTimeStretch
from mltu.preprocessors import AudioReader
from mltu.transformers import LabelIndexer, LabelPadding, AudioPadding
from configs import ModelConfigs
configs = ModelConfigs()
def download_and_unzip(url, extract_to="Datasets", chunk_size=1024*1024):
http_response = urlopen(url)
data = b""
iterations = http_response.length // chunk_size + 1
for _ in tqdm(range(iterations)):
data += http_response.read(chunk_size)
tarFile = tarfile.open(fileobj=BytesIO(data), mode="r|bz2")
tarFile.extractall(path=extract_to)
tarFile.close()
# dataset_path = os.path.join("Datasets", "LJSpeech-1.1")
# if not os.path.exists(dataset_path):
# download_and_unzip("https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2", extract_to="Datasets")
# dataset_path = "Datasets/comcast_xfinity"
dataset_path = "/media/ee/New Volume/mltu/Tutorials/10_wav2vec2_torch/Datasets/comcast_xfinity_rep"
metadata_path = dataset_path + "/metadata.csv"
wavs_path = dataset_path + "/wavs/"
# Read metadata file and parse it
metadata_df = pd.read_csv(metadata_path, sep="|", header=None, quoting=3)
dataset = []
vocab = [' ', "'", 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
for file_name, transcription, normalized_transcription in metadata_df.values.tolist():
# path = f"Datasets/comcast_xfinity/wavs/{file_name}.wav"
path = f"/media/ee/New Volume/mltu/Tutorials/10_wav2vec2_torch/Datasets/comcast_xfinity_rep/wavs/{file_name}.wav"
new_label = "".join([l for l in normalized_transcription.lower() if l in vocab])
dataset.append([path, new_label])
# Create a data provider for the dataset
data_provider = DataProvider(
dataset=dataset,
skip_validation=True,
# batch_size=configs.batch_size,
batch_size=8,
data_preprocessors=[
AudioReader(sample_rate=16000),
],
transformers=[
LabelIndexer(vocab),
],
use_cache=False,
batch_postprocessors=[
AudioPadding(max_audio_length=configs.max_audio_length, padding_value=0, use_on_batch=True),
LabelPadding(padding_value=len(vocab), use_on_batch=True),
],
# batch_postprocessors=[
# AudioPadding(max_audio_length=246000, padding_value=0, use_on_batch=True),
# LabelPadding(padding_value=len(vocab), use_on_batch=True),
# ],
use_multiprocessing=True,
max_queue_size=10,
workers=configs.train_workers,
# workers=20,
)
train_dataProvider, test_dataProvider = data_provider.split(split=0.9)
# for data in tqdm(data_provider):
# print(data)
# train_dataProvider.augmentors = [
# RandomAudioNoise(),
# RandomAudioPitchShift(),
# RandomAudioTimeStretch()
# ]
vocab = sorted(vocab)
configs.vocab = vocab
configs.save()
class CustomWav2Vec2Model(nn.Module):
def __init__(self, hidden_states, dropout_rate=0.2, **kwargs):
super(CustomWav2Vec2Model, self).__init__( **kwargs)
pretrained_name = "facebook/wav2vec2-base-960h"
self.model = Wav2Vec2ForCTC.from_pretrained(pretrained_name, vocab_size=hidden_states, ignore_mismatched_sizes=True)
self.model.freeze_feature_encoder() # this part does not need to be fine-tuned
def forward(self, inputs):
output = self.model(inputs, attention_mask=None).logits
# Apply softmax
output = F.log_softmax(output, -1)
return output
custom_model = CustomWav2Vec2Model(hidden_states = len(vocab)+1)
# put on cuda device if available
if torch.cuda.is_available():
print('Cuda Device Available.')
custom_model = custom_model.cuda()
# create callbacks
warmupCosineDecay = WarmupCosineDecay(
lr_after_warmup=configs.lr_after_warmup,
warmup_epochs=configs.warmup_epochs,
decay_epochs=configs.decay_epochs,
final_lr=configs.final_lr,
initial_lr=configs.init_lr,
verbose=True,
)
tb_callback = TensorBoard(configs.model_path + "/logs")
earlyStopping = EarlyStopping(monitor="val_CER", patience=16, mode="min", verbose=1)
modelCheckpoint = ModelCheckpoint(configs.model_path + "/model.pt", monitor="val_CER", mode="min", save_best_only=True, verbose=1)
model2onnx = Model2onnx(
saved_model_path=configs.model_path + "/model.pt",
input_shape=(1, configs.max_audio_length),
verbose=1,
metadata={"vocab": configs.vocab},
dynamic_axes={"input": {0: "batch_size", 1: "sequence_length"}, "output": {0: "batch_size", 1: "sequence_length"}}
)
# create model object that will handle training and testing of the network
model = Model(
custom_model,
loss = CTCLoss(blank=len(configs.vocab), zero_infinity=True),
optimizer = torch.optim.AdamW(custom_model.parameters(), lr=configs.init_lr, weight_decay=configs.weight_decay),
metrics=[
CERMetric(configs.vocab),
WERMetric(configs.vocab)
],
mixed_precision=configs.mixed_precision,
)
# Save training and validation datasets as csv files
train_dataProvider.to_csv(os.path.join(configs.model_path, "train.csv"))
test_dataProvider.to_csv(os.path.join(configs.model_path, "val.csv"))
model.fit(
train_dataProvider,
test_dataProvider,
epochs=configs.train_epochs,
callbacks=[
warmupCosineDecay,
tb_callback,
earlyStopping,
modelCheckpoint,
model2onnx
]
)
Here is a sample of the dataset.
Hey, I tested and it seems there is some issue related to librosa. When using multiprocessing it doesn't log an error, this is why it was freezing to you, I'll make a fix and release version with bug fix. I'll let you know when you good to go
try to do pip install mltu==1.1.6
and let me know if everything is working
I installed all the requirements in a new conda environment (python 3.8) with mltu==1.1.6 and ran the training script. The pipeline is still getting stuck; this time at the validation step. Here's the output of the console.
Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_v', 'wav2vec2.encoder.pos_conv_embed.conv.weight_g']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized because the shapes did not match:
- lm_head.bias: found shape torch.Size([32]) in the checkpoint and torch.Size([29]) in the model instantiated
- lm_head.weight: found shape torch.Size([32, 768]) in the checkpoint and torch.Size([29, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Cuda Device Available.
INFO:WarmupCosineDecay:Epoch 1 - Learning Rate: 1e-08
Epoch 1 - loss: 26.0389 - CER: 4.3012 - WER: 1.0554: 100%|█| 18/18 [00:10<00:00,
0%| | 0/2 [00:00<?, ?it/s]Exception in thread Thread-15:
Traceback (most recent call last):
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/threading.py", line 932, in _bootstrap_inner
Exception in thread Thread-20:
Traceback (most recent call last):
Exception in thread Thread-17:
Traceback (most recent call last):
Exception in thread Thread-16:
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
Exception in thread Thread-14:
Traceback (most recent call last):
Exception in thread Thread-22:
Traceback (most recent call last):
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/threading.py", line 932, in _bootstrap_inner
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/threading.py", line 870, in run
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/threading.py", line 932, in _bootstrap_inner
Traceback (most recent call last):
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/threading.py", line 932, in _bootstrap_inner
Exception in thread Thread-23:
Traceback (most recent call last):
self.run()
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/threading.py", line 870, in run
Exception in thread Thread-21:
Traceback (most recent call last):
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self._target(*self._args, **self._kwargs)
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/site-packages/mltu/torch/dataProvider.py", line 245, in worker_function
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/threading.py", line 932, in _bootstrap_inner
result = self.function(data_index)
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/site-packages/mltu/dataProvider.py", line 287, in __getitem__
self.run()
self.run()
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/threading.py", line 870, in run
batch_data, batch_annotations = batch_postprocessor(batch_data, batch_annotations)
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/site-packages/mltu/transformers.py", line 227, in __call__
max_len = max([len(a) for a in audio])
self._target(*self._args, **self._kwargs)
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/site-packages/mltu/torch/dataProvider.py", line 245, in worker_function
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/threading.py", line 870, in run
ValueError: max() arg is an empty sequence
self._target(*self._args, **self._kwargs)
self.run()
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/threading.py", line 870, in run
self.run()
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/site-packages/mltu/torch/dataProvider.py", line 245, in worker_function
result = self.function(data_index)
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/site-packages/mltu/dataProvider.py", line 287, in __getitem__
result = self.function(data_index)
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/site-packages/mltu/torch/dataProvider.py", line 245, in worker_function
self.run()
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/site-packages/mltu/torch/dataProvider.py", line 245, in worker_function
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/site-packages/mltu/torch/dataProvider.py", line 245, in worker_function
batch_data, batch_annotations = batch_postprocessor(batch_data, batch_annotations)
self._target(*self._args, **self._kwargs)
result = self.function(data_index)
result = self.function(data_index)
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/site-packages/mltu/dataProvider.py", line 287, in __getitem__
batch_data, batch_annotations = batch_postprocessor(batch_data, batch_annotations)
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/site-packages/mltu/dataProvider.py", line 287, in __getitem__
self._target(*self._args, **self._kwargs)
result = self.function(data_index)
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/site-packages/mltu/torch/dataProvider.py", line 245, in worker_function
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/site-packages/mltu/transformers.py", line 227, in __call__
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/site-packages/mltu/dataProvider.py", line 287, in __getitem__
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/site-packages/mltu/dataProvider.py", line 287, in __getitem__
result = self.function(data_index)
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/site-packages/mltu/dataProvider.py", line 287, in __getitem__
batch_data, batch_annotations = batch_postprocessor(batch_data, batch_annotations)
batch_data, batch_annotations = batch_postprocessor(batch_data, batch_annotations)
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/site-packages/mltu/transformers.py", line 227, in __call__
max_len = max([len(a) for a in audio])
ValueError: max() arg is an empty sequence
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/site-packages/mltu/transformers.py", line 227, in __call__
batch_data, batch_annotations = batch_postprocessor(batch_data, batch_annotations)
max_len = max([len(a) for a in audio])
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/site-packages/mltu/torch/dataProvider.py", line 245, in worker_function
result = self.function(data_index)
ValueError: max() arg is an empty sequence
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/site-packages/mltu/transformers.py", line 227, in __call__
batch_data, batch_annotations = batch_postprocessor(batch_data, batch_annotations)
max_len = max([len(a) for a in audio])
ValueError: max() arg is an empty sequence
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/site-packages/mltu/transformers.py", line 227, in __call__
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/site-packages/mltu/transformers.py", line 227, in __call__
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/site-packages/mltu/dataProvider.py", line 287, in __getitem__
max_len = max([len(a) for a in audio])
ValueError: max() arg is an empty sequence
max_len = max([len(a) for a in audio])
max_len = max([len(a) for a in audio])
ValueError: max() arg is an empty sequence
batch_data, batch_annotations = batch_postprocessor(batch_data, batch_annotations)
File "/home/ee/anaconda3/envs/mltu/lib/python3.8/site-packages/mltu/transformers.py", line 227, in __call__
ValueError: max() arg is an empty sequence
max_len = max([len(a) for a in audio])
ValueError: max() arg is an empty sequence
val_loss: 24.1364 - val_CER: 1.7218 - val_WER: 1.0000: 100%|█| 2/2 [00
Thanks, there was another bug in my code, you received this error because small validation dataset. But now if you pip install mltu==1.1.7
this should be solved. I appreciate that you revealed me these cases :)
I upgraded to mltu==1.1.7 and everything is working perfectly, for both small and large datasets, with the default batch size. Thank you for taking the time to fix the bug.
Thank you for showing me these bugs, because of this, others won't see the same issues :)
Hi! I am trying to fine-tune the wav2vec2 model from your "10_wav2vec2_torch" tutorial. As far as I know, my dataset is in a similar format to the LJ Speech Dataset that you are using as an example. There is a 'wavs' folder which contains the audio files, and a 'metadata.csv' file that has rows of pipe-separated transcriptions. I have been able to successfully run the train.py script on the default dataset (LJ Speech Dataset), but when I use my own dataset, I get this output on the terminal. Am I missing something?