jtkim-kaist / VAD

Voice activity detection (VAD) toolkit including DNN, bDNN, LSTM and ACAM based VAD. We also provide our directly recorded dataset.
834 stars 232 forks source link

Questions understanding bdnn_transform function. #23

Open cqjjjzr opened 5 years ago

cqjjjzr commented 5 years ago

Hi.
I'm trying to rewrite this project in C++ in search of better interoperability, better user friendliness and better performance.

Now I successfully implemented MRCG extraction and get a huge quality boost as well as a small memory usage. However I have some problem understanding the scripts that does the prediction. This script involves lots of array allocating and I want to know the purpose of every single line in order to write better implementation.

So, could you please kindly give an explanation of the bdnn_transform function?

def bdnn_transform(inputs, w, u):

    # """
    # :param inputs. shape = (batch_size, feature_size)
    # :param w : decide neighbors
    # :param u : decide neighbors
    # :return: trans_inputs. shape = (batch_size, feature_size*len(neighbors))
    # """

    neighbors_1 = np.arange(-w, -u, u)
    neighbors_2 = np.array([-1, 0, 1])
    neighbors_3 = np.arange(1+u, w+1, u)

    neighbors = np.concatenate((neighbors_1, neighbors_2, neighbors_3), axis=0)

    pad_size = 2*w + inputs.shape[0]
    pad_inputs = np.zeros((pad_size, inputs.shape[1]))
    pad_inputs[0:inputs.shape[0], :] = inputs

    trans_inputs = [
        np.roll(pad_inputs, -1*neighbors[i], axis=0)[0:inputs.shape[0], :]
                    for i in range(neighbors.shape[0])]

    trans_inputs = np.asarray(trans_inputs)
    trans_inputs = np.transpose(trans_inputs, [1, 0, 2])
    trans_inputs = np.reshape(trans_inputs, (trans_inputs.shape[0], -1))

    return trans_inputs

Thanks in advance.

jtkim-kaist commented 5 years ago

Excellent! thank you for your interest and contributions!

Because it has been a long time since I implemented it, I can't exactly remember it in detail. However, the purpose is, implementing equation (7) in [1]. Also, it will be helpful to refer Fig. 2 in [1].

If there is some spare time for me, I can analyze the written code in detail, however, these day, I'm too busy. Thank you!

[1] X. Zhang and D. Wang, "Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 2, pp. 252-264, Feb. 2016.