ajohnson49 commented 3 years ago

The example in transformer_asr.py is really interesting, but it only shows how to train the model. Can someone give an example of inference is done with the model to get TTS transcripts? Just calling model.predict() doesn't seem to be supported for this model.

ajohnson49 commented 3 years ago

I might have figured out a really inefficient way of doing it using the modules called during training. If anyone has a better way, lemme know:

one_ex = {'audio': 'datasets/LJSpeech-1.1/wavs/LJ007-0224.wav', 'text': 'watever the text is'} two_ex = {'audio': 'datasets/LJSpeech-1.1/wavs/LJ029-0032.wav', 'text': 'watever that text is'} three_ex = {'audio': 'datasets/LJSpeech-1.1/wavs/LJ040-0007.wav', 'text': 'watever this text is'}

eval_data = [one_ex, two_ex, three_ex]

eval_ds = create_tf_dataset(eval_data, bs=3)

for ex in eval_ds: source =ex['source'] target = ex['target'] print(ex['target']) break

target_start_token_idx=2 target_end_token_idx=3

bs = tf.shape(source)[0] preds = model.generate(source, target_start_token_idx) preds = preds.numpy() for i in range(bs): target_text = "".join([idx_tochar[] for _ in target[i, :]]) prediction = "" for idx in preds[i, :]: prediction +=idx_to_char[idx] if idx == target_end_token_idx: break print(f"target: {target_text.replace('-','')}") print(f"prediction: {prediction}\n")

BernardoOlisan commented 2 years ago

@ajohnson49 yes, generate works like that, let me ask you a question. Does the prediction works with new data? bc it seems to have awful predictions with data that is not from the training audios, how was your experience with that?

ajohnson49 commented 2 years ago

Yeah, it worked for me in the end. The dataset included in the example is tiny, so it doesn't generalize well. I pointed the training towards the 100hr set in LibriSpeech, so it does much better for outside data now.

BernardoOlisan commented 2 years ago

@ajohnson49 thank you, so the key to get better results with outside data is to use other dataset, and other question. Why 12000 for training give small results? Shouldn't they be enough?

ajohnson49 commented 2 years ago

I haven't looked at this code in months, but if I remember correctly, it's using around 25 hours of audio data for training. A lot of the standard ASR systems use the 1000 hour LibriSpeech dataset. I think you need at least 100 hours of audio data to start approaching good ASR results for general applications, so 25 hours isn't even close. I think the dataset used in the example is just for show, and you need to use a much bigger dataset for good generalizable results.

BernardoOlisan commented 2 years ago

@ajohnson49 hello again, I have another problem, I'm using the librispeech 500 hours for training, but the problem is that I'm getting loss: nan per epoch an predictions are --------- this happened to you too? I am using the same parameters as the keras docs

ajohnson49 commented 2 years ago

There's a lot of things that could cause that. There could be a nan in the input, the input could be improperly formatted, there could be a zero divide happening somewhere in the data preprocessing, there could be a vanishing or exploding gradient, there could could be an issue with the loss function or hyper parameters used, etc.. You'd have to do some debugging to find the issue. I would look at the output of each step in the code to see the first place the nans come up.

BernardoOlisan commented 2 years ago

There's a lot of things that could cause that. There could be a nan in the input, the input could be improperly formatted, there could be a zero divide happening somewhere in the data preprocessing, there could be a vanishing or exploding gradient, there could could be an issue with the loss function or hyper parameters used, etc.. You'd have to do some debugging to find the issue. I would look at the output of each step in the code to see the first place the nans come up.

Look this is the notebook with the whole code I'm using https://www.kaggle.com/code/bernardoolisan/speechrecognition-dot If you run it, it will say loss: nan and the predictions after epoch 1 will be <------>

ajohnson49 commented 2 years ago

I don't see where you're using LibriSpeech. It looks like you're using the same LJ Speech dataset. Different datasets might be set up differently and need to be preprocessed differently, so there may be something incompatible about the way it's being read in. It's hard to tell without looking at the input files, though. You may just want to print out the output of each step to see where the nans occur.

BernardoOlisan commented 2 years ago

@ajohnson49 look https://www.kaggle.com/code/bernardoolisan/speechrecognition-dot is not LJ Speech dataset, is librispeech.

BernardoOlisan commented 2 years ago

@ajohnson49 maybe, can you share me your parameters of you code? or you don't have it anymore, and how can i see the output of steps?

ajohnson49 commented 2 years ago

I say that it looks like you're still using the LJ Speech dataset because that's the way the code is formatted. For Librispeech, you would need to read in the transcript from the trans.txt file for each .flac file in order to get the transcript which I didn't see in your code. I just did a ctrl+F search, so I might have missed something. But in general, you would need something like:

my_train_dataset = glob('../path/to/Librispeech/train-clean-100///*.flac')

my_train_list = [] for flac in my_train_dataset: trans_file_path=os.path.dirname(file) #get dir name trans_file=glob(trans_file_path+'*trans.txt')[0] #get transcript file name trans_file_open=open(trans_file) trans_file_lines=trans_file_open.readlines() flac_num = flac.split('/')[-1].replace('flac','') #get corresponding transcript line get_text = [line for line in trans_file_lines if flac_num in line] #get transcript dictionary ={'audio': flac , 'text': get_text[0]} #create dictionary my_train_list.append(dictionary)

ds_train = create_tf_dataset(my_train_list, bs=64) #create tf dataset object

and you can just print the output of any step in the code

BernardoOlisan commented 2 years ago

I say that it looks like you're still using the LJ Speech dataset because that's the way the code is formatted. For Librispeech, you would need to read in the transcript from the trans.txt file for each .flac file in order to get the transcript which I didn't see in your code. I just did a ctrl+F search, so I might have missed something. But in general, you would need something like:

my_train_dataset = glob('../path/to/Librispeech/train-clean-100///*.flac')

my_train_list = []

for flac in my_train_dataset:
trans_file_path=os.path.dirname(file) #get dir name

trans_file=glob(trans_file_path+'*trans.txt')[0] #get transcript file name

trans_file_open=open(trans_file)

trans_file_lines=trans_file_open.readlines()

flac_num = flac.split('/')[-1].replace('flac','') #get corresponding transcript line

get_text = [line for line in trans_file_lines if flac_num in line] #get transcript

dictionary ={'audio': flac , 'text': get_text[0]} #create dictionary

my_train_list.append(dictionary)
ds_train = create_tf_dataset(my_train_list, bs=64) #create tf dataset object

and you can just print the output of any step in the code

Yes, it is actually that, the only thing that change is that instead of .flac files they are .wav files but is the same trans.txt

BernardoOlisan commented 2 years ago

@ajohnson49 what was you learning rate with the librispeech? did you change some params when training with librispeech?

ajohnson49 commented 2 years ago

I don't think that I changed anything else in the code. You might benefit from optimizing the hyper parameters, though.

BernardoOlisan commented 2 years ago

@ajohnson49 I think the problem is normalizing my data, did you normalized your data "librispeech" or just doing it normal? s there a way you can share me your code or sth? it would be nice to see it

ajohnson49 commented 2 years ago

Not sure what you mean by normalize. You can make the range of the inputs -1 to 1 if you want. That can help reduce the exploding gradient problem in some cases. Not sure if it makes a difference here, though, since the code already has batch normalization. I don't have my original code, so I just implemented this example quickly:

!/usr/bin/env python

coding: utf-8

Automatic Speech Recognition with Transformer

Author: Apoorv Nandan

Date created: 2021/01/13

Last modified: 2021/01/13

Description: Training a sequence-to-sequence Transformer for automatic speech recognition.

Introduction

Automatic speech recognition (ASR) consists of transcribing audio speech segments into text.

ASR can be treated as a sequence-to-sequence problem, where the

audio can be represented as a sequence of feature vectors

and the text as a sequence of characters, words, or subword tokens.

For this demonstration, we will use the LJSpeech dataset from the

LibriVox project. It consists of short

audio clips of a single speaker reading passages from 7 non-fiction books.

Our model will be similar to the original Transformer (both encoder and decoder)

as proposed in the paper, "Attention is All You Need".

References:

- Attention is All You Need

- Very Deep Self-Attention Networks for End-to-End Speech Recognition

- Speech Transformers

- LJSpeech Dataset

In[1]:

import re import os import random from glob import glob import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers os.environ["CUDA_VISIBLE_DEVICES"]="1"

Define the Transformer Input Layer

When processing past target tokens for the decoder, we compute the sum of

position embeddings and token embeddings.

When processing audio features, we apply convolutional layers to downsample

them (via convolution stides) and process local relationships.

In[2]:

class TokenEmbedding(layers.Layer): def init(self, num_vocab=1000, maxlen=100, num_hid=64): super().init() self.emb = tf.keras.layers.Embedding(num_vocab, num_hid) self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=num_hid)

def call(self, x):
    maxlen = tf.shape(x)[-1]
    x = self.emb(x)
    positions = tf.range(start=0, limit=maxlen, delta=1)
    positions = self.pos_emb(positions)
    return x + positions

class SpeechFeatureEmbedding(layers.Layer): def init(self, num_hid=64, maxlen=100): super().init() self.conv1 = tf.keras.layers.Conv1D( num_hid, 11, strides=2, padding="same", activation="relu" ) self.conv2 = tf.keras.layers.Conv1D( num_hid, 11, strides=2, padding="same", activation="relu" ) self.conv3 = tf.keras.layers.Conv1D( num_hid, 11, strides=2, padding="same", activation="relu" ) self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=num_hid)

def call(self, x):
    x = self.conv1(x)
    x = self.conv2(x)
    return self.conv3(x)

Transformer Encoder Layer

In[3]:

class TransformerEncoder(layers.Layer): def init(self, embed_dim, num_heads, feed_forward_dim, rate=0.1): super().init() self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim) self.ffn = keras.Sequential( [ layers.Dense(feed_forward_dim, activation="relu"), layers.Dense(embed_dim), ] ) self.layernorm1 = layers.LayerNormalization(epsilon=1e-6) self.layernorm2 = layers.LayerNormalization(epsilon=1e-6) self.dropout1 = layers.Dropout(rate) self.dropout2 = layers.Dropout(rate)

def call(self, inputs, training):
    attn_output = self.att(inputs, inputs)
    attn_output = self.dropout1(attn_output, training=training)
    out1 = self.layernorm1(inputs + attn_output)
    ffn_output = self.ffn(out1)
    ffn_output = self.dropout2(ffn_output, training=training)
    return self.layernorm2(out1 + ffn_output)

Transformer Decoder Layer

In[4]:

class TransformerDecoder(layers.Layer): def init(self, embed_dim, num_heads, feed_forward_dim, dropout_rate=0.1): super().init() self.layernorm1 = layers.LayerNormalization(epsilon=1e-6) self.layernorm2 = layers.LayerNormalization(epsilon=1e-6) self.layernorm3 = layers.LayerNormalization(epsilon=1e-6) self.self_att = layers.MultiHeadAttention( num_heads=num_heads, key_dim=embed_dim ) self.enc_att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim) self.self_dropout = layers.Dropout(0.5) self.enc_dropout = layers.Dropout(0.1) self.ffn_dropout = layers.Dropout(0.1) self.ffn = keras.Sequential( [ layers.Dense(feed_forward_dim, activation="relu"), layers.Dense(embed_dim), ] )

def causal_attention_mask(self, batch_size, n_dest, n_src, dtype):
    """Masks the upper half of the dot product matrix in self attention.

    This prevents flow of information from future tokens to current token.
    1's in the lower triangle, counting from the lower right corner.
    """
    i = tf.range(n_dest)[:, None]
    j = tf.range(n_src)
    m = i >= j - n_src + n_dest
    mask = tf.cast(m, dtype)
    mask = tf.reshape(mask, [1, n_dest, n_src])
    mult = tf.concat(
        [tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)], 0
    )
    return tf.tile(mask, mult)

def call(self, enc_out, target):
    input_shape = tf.shape(target)
    batch_size = input_shape[0]
    seq_len = input_shape[1]
    causal_mask = self.causal_attention_mask(batch_size, seq_len, seq_len, tf.bool)
    target_att = self.self_att(target, target, attention_mask=causal_mask)
    target_norm = self.layernorm1(target + self.self_dropout(target_att))
    enc_out = self.enc_att(target_norm, enc_out)
    enc_out_norm = self.layernorm2(self.enc_dropout(enc_out) + target_norm)
    ffn_out = self.ffn(enc_out_norm)
    ffn_out_norm = self.layernorm3(enc_out_norm + self.ffn_dropout(ffn_out))
    return ffn_out_norm

Complete the Transformer model

Our model takes audio spectrograms as inputs and predicts a sequence of characters.

During training, we give the decoder the target character sequence shifted to the left

as input. During inference, the decoder uses its own past predictions to predict the

next token.

In[5]:

class Transformer(keras.Model): def init( self, num_hid=64, num_head=2, num_feed_forward=128, source_maxlen=100, target_maxlen=100, num_layers_enc=4, num_layers_dec=1, num_classes=10, ): super().init() self.loss_metric = keras.metrics.Mean(name="loss") self.num_layers_enc = num_layers_enc self.num_layers_dec = num_layers_dec self.target_maxlen = target_maxlen self.num_classes = num_classes

    self.enc_input = SpeechFeatureEmbedding(num_hid=num_hid, maxlen=source_maxlen)
    self.dec_input = TokenEmbedding(
        num_vocab=num_classes, maxlen=target_maxlen, num_hid=num_hid
    )

    self.encoder = keras.Sequential(
        [self.enc_input]
        + [
            TransformerEncoder(num_hid, num_head, num_feed_forward)
            for _ in range(num_layers_enc)
        ]
    )

    for i in range(num_layers_dec):
        setattr(
            self,
            f"dec_layer_{i}",
            TransformerDecoder(num_hid, num_head, num_feed_forward),
        )

    self.classifier = layers.Dense(num_classes)

def decode(self, enc_out, target):
    y = self.dec_input(target)
    for i in range(self.num_layers_dec):
        y = getattr(self, f"dec_layer_{i}")(enc_out, y)
    return y

def call(self, inputs):
    source = inputs[0]
    target = inputs[1]
    x = self.encoder(source)
    y = self.decode(x, target)
    return self.classifier(y)

@property
def metrics(self):
    return [self.loss_metric]

def train_step(self, batch):
    """Processes one batch inside model.fit()."""
    source = batch["source"]
    target = batch["target"]
    dec_input = target[:, :-1]
    dec_target = target[:, 1:]
    with tf.GradientTape() as tape:
        preds = self([source, dec_input])
        one_hot = tf.one_hot(dec_target, depth=self.num_classes)
        mask = tf.math.logical_not(tf.math.equal(dec_target, 0))
        loss = self.compiled_loss(one_hot, preds, sample_weight=mask)
    trainable_vars = self.trainable_variables
    gradients = tape.gradient(loss, trainable_vars)
    self.optimizer.apply_gradients(zip(gradients, trainable_vars))
    self.loss_metric.update_state(loss)
    return {"loss": self.loss_metric.result()}

def test_step(self, batch):
    source = batch["source"]
    target = batch["target"]
    dec_input = target[:, :-1]
    dec_target = target[:, 1:]
    preds = self([source, dec_input])
    one_hot = tf.one_hot(dec_target, depth=self.num_classes)
    mask = tf.math.logical_not(tf.math.equal(dec_target, 0))
    loss = self.compiled_loss(one_hot, preds, sample_weight=mask)
    self.loss_metric.update_state(loss)
    return {"loss": self.loss_metric.result()}

def generate(self, source, target_start_token_idx):
    """Performs inference over one batch of inputs using greedy decoding."""
    bs = tf.shape(source)[0]
    enc = self.encoder(source)
    dec_input = tf.ones((bs, 1), dtype=tf.int32) * target_start_token_idx
    dec_logits = []
    for i in range(self.target_maxlen - 1):
        dec_out = self.decode(enc, dec_input)
        logits = self.classifier(dec_out)
        logits = tf.argmax(logits, axis=-1, output_type=tf.int32)
        last_logit = tf.expand_dims(logits[:, -1], axis=-1)
        dec_logits.append(last_logit)
        dec_input = tf.concat([dec_input, last_logit], axis=-1)
    return dec_input

Download the dataset

Note: This requires ~3.6 GB of disk space and

takes ~5 minutes for the extraction of files.

In[6]:

keras.utils.get_file(

os.path.join(os.getcwd(), "data.tar.gz"),

"https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2",

extract=True,

archive_format="tar",

cache_dir=".",

)

saveto = "../home/datasets/LJSpeech-1.1" wavs = glob("{}/*/.wav".format(saveto), recursive=True)

id_to_text = {} with open(os.path.join(saveto, "metadata.csv"), encoding="utf-8") as f: for line in f: id = line.strip().split("|")[0] text = line.strip().split("|")[2] id_to_text[id] = text

def get_data(wavs, id_to_text, maxlen=50): """ returns mapping of audio paths and transcription texts """ data = [] for w in wavs: id = w.split("/")[-1].split(".")[0] if len(id_to_text[id]) < maxlen: data.append({"audio": w, "text": id_to_text[id]}) return data

Preprocess the dataset

In[58]:

import numpy as np import tensorflow_io as tfio

class VectorizeChar: def init(self, max_len=50): self.vocab = ( ["-", "#", "<", ">"]

[chr(i + 96) for i in range(1, 27)]
[" ", ".", ",", "?"] ) self.max_len = max_len self.char_to_idx = {} for i, ch in enumerate(self.vocab): self.char_to_idx[ch] = i

def call(self, text): text = text.lower() text = text[: self.max_len - 2] text = "<" + text + ">" pad_len = self.max_len - len(text) return [self.char_to_idx.get(ch, 1) for ch in text] + [0] * pad_len

def get_vocabulary(self): return self.vocab

max_target_len = 200 # all transcripts in out data are < 200 characters data = get_data(wavs, id_to_text, max_target_len) vectorizer = VectorizeChar(max_target_len) print("vocab size", len(vectorizer.get_vocabulary()))

def create_textds(data): texts = [["text"] for _ in data] text_ds = [vectorizer(t) for t in texts] text_ds = tf.data.Dataset.from_tensor_slices(text_ds) return text_ds

def path_to_audio(path):

spectrogram using stft

audio = tf.io.read_file(path)
audio = tfio.audio.decode_flac(audio, dtype=tf.int16) #audio, _ = tf.audio.decode_wav(audio, 1)
audio = tf.cast(audio, tf.float32)
audio = tf.squeeze(audio, axis=-1)
stfts = tf.signal.stft(audio, frame_length=200, frame_step=80, fft_length=256)
x = tf.math.pow(tf.abs(stfts), 0.5)
# normalisation
means = tf.math.reduce_mean(x, 1, keepdims=True)
stddevs = tf.math.reduce_std(x, 1, keepdims=True)
x = (x - means) / stddevs
value_not_nan = tf.dtypes.cast(tf.math.logical_not(tf.math.is_nan(x)), dtype=tf.float32)
tf.math.multiply_no_nan(x, value_not_nan)
audio_len = tf.shape(x)[0]
# padding to 10 seconds
pad_len = 2754
paddings = tf.constant([[0, pad_len], [0, 0]])
x = tf.pad(x, paddings, "CONSTANT")[:pad_len, :]
return x

def create_audiods(data): flist = [["audio"] for _ in data] audio_ds = tf.data.Dataset.from_tensor_slices(flist) audio_ds = audio_ds.map( path_to_audio, num_parallel_calls=tf.data.experimental.AUTOTUNE ) return audio_ds

def create_tf_dataset(data, bs=4): audio_ds = create_audio_ds(data) text_ds = create_text_ds(data) ds = tf.data.Dataset.zip((audio_ds, text_ds)) ds = ds.map(lambda x, y: {"source": x, "target": y}) ds = ds.batch(bs) ds = ds.prefetch(tf.data.experimental.AUTOTUNE) return ds

split = int(len(data) * 0.99)

train_data = data[:split]

test_data = data[split:]

ds = create_tf_dataset(train_data, bs=64)

val_ds = create_tf_dataset(test_data, bs=4)

In[8]:

my_train_dataset = glob('../../data/Databases/LibriSpeech/Librispeech/train-clean-100/103//.flac') my_val_dataset = glob('../../data/Databases/LibriSpeech/Librispeech/train-clean-100/8630//.flac') my_test_dataset=glob('../../data/Databases/LibriSpeech/Librispeech/train-clean-100/1502//.flac')

do it for the whole set if you want to use the whole database

this is just an example for a few of the folders in the database

Get the myst dataset and create the list of dictionaries

In[59]:

my_train_list = [] for flac in my_train_dataset: trans_file_path=os.path.dirname(flac) trans_file=glob(trans_file_path+'/*trans.txt')[0] trans_file_open=open(trans_file) trans_file_lines=trans_file_open.readlines() flac_num = flac.split('/')[-1].replace('.flac','') get_text = [line for line in trans_file_lines if flac_num in line] dictionary ={'audio': flac , 'text': get_text[0].replace(flac_num,'')} my_train_list.append(dictionary)

ds_train = create_tf_dataset(my_train_list, bs=64)

In[60]:

my_val_list = []

for flac in my_val_dataset: trans_file_path=os.path.dirname(flac) trans_file=glob(trans_file_path+'/*trans.txt')[0] trans_file_open=open(trans_file) trans_file_lines=trans_file_open.readlines() flac_num = flac.split('/')[-1].replace('.flac','') get_text = [line for line in trans_file_lines if flac_num in line] dictionary ={'audio': flac , 'text': get_text[0].replace(flac_num,'')} my_val_list.append(dictionary)

ds_val = create_tf_dataset(my_val_list, bs=5)

In[61]:

my_test_list = []

for flac in my_test_dataset: trans_file_path=os.path.dirname(flac) trans_file=glob(trans_file_path+'/*trans.txt')[0] trans_file_open=open(trans_file) trans_file_lines=trans_file_open.readlines() flac_num = flac.split('/')[-1].replace('.flac','') get_text = [line for line in trans_file_lines if flac_num in line] dictionary ={'audio': flac , 'text': get_text[0].replace(flac_num,'')} my_test_list.append(dictionary)

ds_test = create_tf_dataset(my_test_list, bs=32)

Callbacks to display predictions

In[31]:

class DisplayOutputs(keras.callbacks.Callback): def init( self, batch, idx_to_token, target_start_token_idx=27, target_end_token_idx=28 ): """Displays a batch of outputs after every epoch

    Args:
        batch: A test batch containing the keys "source" and "target"
        idx_to_token: A List containing the vocabulary tokens corresponding to their indices
        target_start_token_idx: A start token index in the target vocabulary
        target_end_token_idx: An end token index in the target vocabulary
    """
    self.batch = batch
    self.target_start_token_idx = target_start_token_idx
    self.target_end_token_idx = target_end_token_idx
    self.idx_to_char = idx_to_token

def on_epoch_end(self, epoch, logs=None):
    if epoch % 5 != 0:
        return
    source = self.batch["source"]
    target = self.batch["target"].numpy()
    bs = tf.shape(source)[0]
    preds = self.model.generate(source, self.target_start_token_idx)
    preds = preds.numpy()
    for i in range(bs):
        target_text = "".join([self.idx_to_char[_] for _ in target[i, :]])
        prediction = ""
        for idx in preds[i, :]:
            prediction += self.idx_to_char[idx]
            if idx == self.target_end_token_idx:
                break
        print(f"target:     {target_text.replace('-','')}")
        print(f"prediction: {prediction}\n")

Learning rate schedule

In[32]:

class CustomSchedule(keras.optimizers.schedules.LearningRateSchedule): def init( self, init_lr=0.00001, lr_after_warmup=0.001, final_lr=0.00001, warmup_epochs=15, decay_epochs=85, steps_per_epoch=203, ): super().init() self.init_lr = init_lr self.lr_after_warmup = lr_after_warmup self.final_lr = final_lr self.warmup_epochs = warmup_epochs self.decay_epochs = decay_epochs self.steps_per_epoch = steps_per_epoch

def calculate_lr(self, epoch):
    """ linear warm up - linear decay """
    warmup_lr = (
        self.init_lr
        + ((self.lr_after_warmup - self.init_lr) / (self.warmup_epochs - 1)) * epoch
    )
    decay_lr = tf.math.maximum(
        self.final_lr,
        self.lr_after_warmup
        - (epoch - self.warmup_epochs)
        * (self.lr_after_warmup - self.final_lr)
        / (self.decay_epochs),
    )
    return tf.math.minimum(warmup_lr, decay_lr)

def __call__(self, step):
    epoch = step // self.steps_per_epoch
    return self.calculate_lr(epoch)

Create & train the end-to-end model

In[ ]:

batch = next(iter(ds_val))

The vocabulary to convert predicted indices into characters

idx_to_char = vectorizer.get_vocabulary() display_cb = DisplayOutputs( batch, idx_to_char, target_start_token_idx=2, target_end_token_idx=3 ) # set the arguments as per vocabulary index for '<' and '>'

model = Transformer( num_hid=200, num_head=2, num_feed_forward=400, target_maxlen=max_target_len, num_layers_enc=4, num_layers_dec=1, num_classes=34, ) loss_fn = tf.keras.losses.CategoricalCrossentropy( from_logits=True, label_smoothing=0.1, )

learning_rate = CustomSchedule( init_lr=0.00001, lr_after_warmup=0.001, final_lr=0.00001, warmup_epochs=15, decay_epochs=85, steps_per_epoch=len(ds_train), ) optimizer = keras.optimizers.SGD(learning_rate=learning_rate) model.compile(optimizer=optimizer, loss=loss_fn)

history = model.fit(ds_train, validation_data=ds_val, callbacks=[display_cb], epochs=100)

In[ ]:

model.summary()

In[39]:

from glob import glob import soundfile as sf import numpy as np import pandas as pd

def get_eval(eval_ds): for ex in eval_ds: source =ex['source'] target = ex['target'] break

target_start_token_idx=2
target_end_token_idx=3

bs = tf.shape(source)[0]
preds = model.generate(source, target_start_token_idx)
preds = preds.numpy()
for i in range(bs):
    target_text = "".join([idx_to_char[_] for _ in target[i, :]])
    prediction = ""
    for idx in preds[i, :]:
        prediction +=idx_to_char[idx]
        if idx == target_end_token_idx:
            break
    print(f"target:     {target_text.replace('-','')}")
    print(f"prediction: {prediction}\n")

In[ ]:

shift=100 left=0 right=shift

while left<len(my_test_list): try_list=my_test_list[left:right]

ds_t = create_tf_dataset(try_list, bs=shift)
get_eval(ds_t)
left+=shift
right+=shift

In[ ]:

BernardoOlisan commented 2 years ago

@ajohnson49 I found out why you were looking at LJSpeech, I updated, here it is https://www.kaggle.com/code/bernardoolisan/speechrecognition-dot maybe you can check it out

SuryanarayanaY commented 1 year ago

Hi @ajohnson49 ,

In the tutorial there is no code provided for inference after training, because as you can see the model is trained for only epochs=1 and hence the inference results might not looks good. In real time we may need to train the model for many more epochs till we get desired accuracy which will take much time. Hence the Author left it to users choice if I am not wrong.

Maybe you can use a simple code like model.predict(ds.take(1))) to check the results for a single batch .

But I can see now that at the end author provided some data of prediction and target around 30 epochs to visualise the results.

I hope limitation of epochs made the author to not add model inference results in the example.

Thanks!

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 28 days. Please reopen if you'd like to work on this further.

keras-team / keras-io

Need example for inference in transformer_asr.py #537

!/usr/bin/env python

coding: utf-8

Automatic Speech Recognition with Transformer

Author: Apoorv Nandan

Date created: 2021/01/13

Last modified: 2021/01/13

Description: Training a sequence-to-sequence Transformer for automatic speech recognition.

Introduction

Automatic speech recognition (ASR) consists of transcribing audio speech segments into text.

ASR can be treated as a sequence-to-sequence problem, where the

audio can be represented as a sequence of feature vectors

and the text as a sequence of characters, words, or subword tokens.

For this demonstration, we will use the LJSpeech dataset from the

LibriVox project. It consists of short

audio clips of a single speaker reading passages from 7 non-fiction books.

Our model will be similar to the original Transformer (both encoder and decoder)

as proposed in the paper, "Attention is All You Need".

References:

- Attention is All You Need

- Very Deep Self-Attention Networks for End-to-End Speech Recognition

- Speech Transformers

- LJSpeech Dataset

In[1]:

Define the Transformer Input Layer

When processing past target tokens for the decoder, we compute the sum of

position embeddings and token embeddings.

When processing audio features, we apply convolutional layers to downsample

them (via convolution stides) and process local relationships.

In[2]:

Transformer Encoder Layer

In[3]:

Transformer Decoder Layer

In[4]:

Complete the Transformer model

Our model takes audio spectrograms as inputs and predicts a sequence of characters.

During training, we give the decoder the target character sequence shifted to the left

as input. During inference, the decoder uses its own past predictions to predict the

next token.

In[5]:

Download the dataset

Note: This requires ~3.6 GB of disk space and

takes ~5 minutes for the extraction of files.

In[6]:

keras.utils.get_file(

os.path.join(os.getcwd(), "data.tar.gz"),

"https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2",

extract=True,

archive_format="tar",

cache_dir=".",

)

Preprocess the dataset

In[58]:

spectrogram using stft

split = int(len(data) * 0.99)

train_data = data[:split]

test_data = data[split:]

ds = create_tf_dataset(train_data, bs=64)

val_ds = create_tf_dataset(test_data, bs=4)

In[8]:

do it for the whole set if you want to use the whole database

this is just an example for a few of the folders in the database

Get the myst dataset and create the list of dictionaries

In[59]:

In[60]:

In[61]:

Callbacks to display predictions

In[31]:

Learning rate schedule

In[32]:

Create & train the end-to-end model

In[ ]:

The vocabulary to convert predicted indices into characters

In[ ]:

In[39]:

In[ ]:

In[ ]: