speechLabBcCuny / messlJsalt15

MESSL wrappers etc for JSALT 2015, including CHiME3
7 stars 7 forks source link

Implement a new objective function in keras #10

Closed nateanl closed 7 years ago

nateanl commented 7 years ago

The objective function is described in this paper: http://www.jonathanleroux.org/pdf/Erdogan2015ICASSP04.pdf The formula is: D_psa(a) = (a*|y| − |s| cos(θ)^2 In keras, the arguments of loss function are y_true, y_pred. To adjust the function into keras, we need to add another two arguments, the noisy speech spectrogram and the clean speech spectrogram.

mim commented 7 years ago

Please post the whole equation here so it's easy to refer to. I assume that keras has many objective functions, so this isn't overriding the one objective function as much as implementing a new one.

nateanl commented 7 years ago

I fixed something in the loss function, by now it only uses one single model and the output dimension is 1026. But we are currently building two models: one tries to predict the mask and the other just output the input itself, and we can merge the two into one model, and use the loss function above. By now I just used 10 file as input, and the prediction is not bad: The first one is ground truth, which is ideal_amplitude mask. target_mask The second one is the prediction. prediction

mim commented 7 years ago

Looks good for 10 files, but do it for real now on all of the files.

nateanl commented 7 years ago

Experiment setting: layers: one LSTM, one Dense batch size for training: 128 number of epoch: 50 optimizer: RMSprop loss function: (a*|y| - |s|)^2 Input: "normalized" spectrum. (divided by the maximum of absolute values) ground truth: clean spectrum and noisy spectrum concatenated

Data: training: all training data in CHiME2 data set test: one file from testing data

Prediction: prediction

Original spectrogram: test_spect

nateanl commented 7 years ago

So I added a normalization layer to the model To see if the model works or not, I used "binary_crossentropy" loss function to train the model The inputs are log-mel spectrum from all training data The targets are just masks. Test is one audio file from test data. This is input, and it will be normalized to mean 0 and deviation 1 in the model. input This is prediction prediction

I'm training the model using the loss function now to see how the results will be..

nateanl commented 7 years ago

I found it hard to get a pretty good result when training the model using the loss function we defined. When I turned to read the paper again, it said: Prior work [6] showed that careful multi-stage training of the LSTM network is essential to obtain good speech separation performance. Following this recipe, the network was first trained to predict a 100-bin mel-transformed mask output using the mask approximation cost function. So I trained the model using mean square error loss function first. The result is not bad: ground_truth prediction

Then I kept the weights of current model, and continue to train it with a new loss function. Will report the result when the training part is finished.

nateanl commented 7 years ago

First train the model to predict the mask using mean square error loss function. When finished the training, keep the weights and continue to train the same model using the new loss function.

Test one file from the 0dB audio files. And the SDR score is 8.7 dB. Will report the average score soon..

nateanl commented 7 years ago

Average SDR result (in dB): 0dB 3dB 6dB 9dB -3dB -6dB 5.16 6.32 7.13 7.74 4.01 2.95

This is for one normal LSTM layer model and mask type is ideal amplitude. Building two Bi-directional LSTM model is supposed to improve the result...