igormq / ctc_tensorflow_example

CTC + Tensorflow Example for ASR
MIT License
313 stars 183 forks source link

CTC on multiple training example #8

Closed ajason6208 closed 7 years ago

ajason6208 commented 7 years ago

In your program , you have one wav file and you extract the MFCC feature 13 dimension => train_inputs. Second, you construct a label array like this [19 8 5 ...] and change it to Sparse representation ( function : sparse_tuple_from ) . I do not know why should change label into Sparse representation.

In my case, I extract MFCC feature 14 dimension , but I have 8440 training data. I do not know how could I create an array to save all of feature because each frame is different of different file, Please help me tks.

I like your example for your ctc neural, tks you give us an useful code.

mssmkmr commented 7 years ago

I learned from his program to use CTC multiple data also. And It works well.

I do not know why should change label into Sparse representation.

Because, Tensorflow's CTC api requires SparseTensor as label. Conectionist Temporal Classification (CTC) | TensorFlow

I do not know how could I create an array to save all of feature because each frame is different of different file,

It is difficult to explain in a word for me. So, I will show you my code snippet

    timestep_factor = 1000 # maximum frames of features
    zero_features = np.array([0] * num_features)
    zero_features = zero_features.reshape(1, num_features).tolist()
    train_skip_idx = []
    train_seq_len = []
    train_inputs = []
    i = 0
    for l in audio_filenames: # audio_filenames is a list of wav file
        i+=1
        audio, sr = librosa.load(l, mono=True)
        inputs = librosa.feature.melspectrogram(audio, sr=sr, n_mels=num_features)
        inputs = inputs.transpose((1, 0))
        inputs = inputs.tolist()
        if len(inputs) < timestep_factor:
            train_seq_len.append(len(inputs))
            tmp_list = zero_features * (timestep_factor - len(inputs))
            inputs.extend(tmp_list)
            train_num += 1
        else:
            train_skip_idx.append(i-1)
            continue
        train_inputs.append(inputs)

    train_labels = []
    for i, l in enumerate(targets_line):
        if i in train_skip_idx:
            continue
        phones = l.split(' ')
        phones = list(filter(('').__ne__, phones))
        train_labels.append([phonemes[x] for x in phones])

    # if you want to save features, you can do as follows
    np.save("train_inputs.npy", train_inputs)
    np.save("train_targets.npy", train_labels)
    np.save("train_seq_len.npy", train_seq_len)
    # ----

    train_targets = []
        for mb_i in range(int(train_labels.shape[0]/mini_batch_size)):
            train_targets.append(sparse_tuple_from(train_labels[mb_i*mini_batch_size(mb_i+1)*mini_batch_size], num_classes))
    # snip
     for mb_num in range(num_batches_per_epoch):
            feed = {inputs: train_inputs[mb_num*mini_batch_size:(mb_num+1)*mini_batch_size],
                targets: train_targets[mb_num],
                seq_len: train_seq_len[mb_num*mini_batch_size:(mb_num+1)*mini_batch_size]}

            batch_cost, _ = session.run([cost, train_op], feed)
            train_cost += batch_cost
            train_ler += session.run(ler, feed_dict=feed)
        train_cost /= num_batches_per_epoch
        train_ler /= num_batches_per_epoch
    # snip

For my understanding, if you have a question, you could ask me.

igormq commented 7 years ago

Hi, @ajason6208. Thank you for the question.

Answering your questions.

I do not know why should change label into Sparse representation.

I changed the labels to sparse representation because the ctc_loss require that, it's in the documentation.

In my case, I extract MFCC feature 14 dimension , but I have 8440 training data. I do not know how could I create an array to save all of feature because each frame is different of different file, Please help me tks.

There are several approach to do that. You can append zeros and create a tensor N x max_timesteps x n_features for all your training data, but this will require more memory; you can create buckets (this is done in one example made by Tensorflow team); you can read N_per_batch audio files, generate the features and create each batch dynamically, append the zeros to the batch data; or you can save all your training data in one matrix T x n_features where you append each training data by the timestep, creating a giant matrix and keep the record of each training data timesteps, and then read it accordingly.

I'll commit a code demonstrating one of these methods. Stay tune.

igormq commented 7 years ago

I made a commit f2f935e6b1906df2543b4ed794286427870995d5 modifying the original code to support multiple data as input.

igormq commented 7 years ago

Close #8