leichtrhino / ChimeraNet

Unofficial implementation of music separation model by Luo et.al.
MIT License
13 stars 2 forks source link

Mask-Inference layers #10

Closed prashant45 closed 5 years ago

prashant45 commented 5 years ago

https://github.com/arity-r/ChimeraNet/blob/6341383c61f238a83a0be8c7d4972aac4e7d958a/chimeranet/model.py#L57-L69

Correct me if I am wrong. One can simply change this block of code with Dense, and Reshape layers like this:

mask_linear = Dense(self.F*self.C, activation='softmax', name='mask_linear')(body_linear)
mask = Reshape((self.T, self.F, self.C), name='mask')(mask_linear)

I think, the API can handle the gradients update accordingly because of the Reshape layer. Also, it does not require much memory because the masks are not extracted a list. But I wonder if it would be a a correct definition for the mask-inference head of the model.

The reference I use is the Chimera++ network from the paper ALTERNATIVE OBJECTIVE FUNCTIONS FOR DEEP CLUSTERING which redefines the architecture for speaker separation.

Screenshot from 2019-08-01 11-11-57

leichtrhino commented 5 years ago

Thank you for suggesting!

From the results of summary() function keras provides, these two implementations are different. The number of training parameters of suggested impl. is larger than the one of current impl.

Although I have measured the model size neither on memory nor on disk, if a model has more training parameters, more expressive powers come with more memory consumption. So, I am skeptical about suggested implementation requires less memory.

I can't say anything on correctness of the model as I think there are only 'good' model. If you want to implement Chimera++ you refer, mask_linear could be on the body_blstm_n. Since body_linear is actually an embedding layer and the paper claims that the motivation for setting a mask inference layer on an embedding layer is unclear.

The followings are results from summary() function of two implementations with T, F, C, D = 5, 4, 2, 3 and less, small BLSTM layers.

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
input (InputLayer)              (None, 5, 4)         0
__________________________________________________________________________________________________
body_blstm_1 (Bidirectional)    (None, 5, 20)        1200        input[0][0]
__________________________________________________________________________________________________
body_blstm_2 (Bidirectional)    (None, 5, 20)        2480        body_blstm_1[0][0]
__________________________________________________________________________________________________
body_linear (Dense)             (None, 5, 12)        252         body_blstm_2[0][0]
__________________________________________________________________________________________________
body (Reshape)                  (None, 5, 4, 3)      0           body_linear[0][0]
__________________________________________________________________________________________________
embedding_activation (Activatio (None, 5, 4, 3)      0           body[0][0]
__________________________________________________________________________________________________
mask_linear (Dense)             (None, 5, 8)         104         body_linear[0][0]
__________________________________________________________________________________________________
embedding (Lambda)              (None, 5, 4, 3)      0           embedding_activation[0][0]
__________________________________________________________________________________________________
mask (Reshape)                  (None, 5, 4, 2)      0           mask_linear[0][0]
==================================================================================================
Total params: 4,036
Trainable params: 4,036
Non-trainable params: 0
__________________________________________________________________________________________________
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
input (InputLayer)              (None, 5, 4)         0
__________________________________________________________________________________________________
body_blstm_1 (Bidirectional)    (None, 5, 20)        1200        input[0][0]
__________________________________________________________________________________________________
body_blstm_2 (Bidirectional)    (None, 5, 20)        2480        body_blstm_1[0][0]
__________________________________________________________________________________________________
body_linear (Dense)             (None, 5, 12)        252         body_blstm_2[0][0]
__________________________________________________________________________________________________
body (Reshape)                  (None, 5, 4, 3)      0           body_linear[0][0]
__________________________________________________________________________________________________
mask_slice_1 (Lambda)           (None, 5, 3)         0           body[0][0]
__________________________________________________________________________________________________
mask_slice_2 (Lambda)           (None, 5, 3)         0           body[0][0]
__________________________________________________________________________________________________
mask_slice_3 (Lambda)           (None, 5, 3)         0           body[0][0]
__________________________________________________________________________________________________
mask_slice_4 (Lambda)           (None, 5, 3)         0           body[0][0]
__________________________________________________________________________________________________
embedding_activation (Activatio (None, 5, 4, 3)      0           body[0][0]
__________________________________________________________________________________________________
mask_linear_1 (Dense)           (None, 5, 2)         8           mask_slice_1[0][0]
__________________________________________________________________________________________________
mask_linear_2 (Dense)           (None, 5, 2)         8           mask_slice_2[0][0]
__________________________________________________________________________________________________
mask_linear_3 (Dense)           (None, 5, 2)         8           mask_slice_3[0][0]
__________________________________________________________________________________________________
mask_linear_4 (Dense)           (None, 5, 2)         8           mask_slice_4[0][0]
__________________________________________________________________________________________________
embedding (Lambda)              (None, 5, 4, 3)      0           embedding_activation[0][0]
__________________________________________________________________________________________________
mask (Lambda)                   (None, 5, 4, 2)      0           mask_linear_1[0][0]
                                                                 mask_linear_2[0][0]
                                                                 mask_linear_3[0][0]
                                                                 mask_linear_4[0][0]
==================================================================================================
Total params: 3,964
Trainable params: 3,964
Non-trainable params: 0
__________________________________________________________________________________________________
prashant45 commented 5 years ago

Thank you for the detailed answer.

You are right about the number of training parameters for model.summary().

The reason for the memory problem I mentioned, was with respect to the list of layers mask_slice_n (I think). I already tried running both architectures on a NVIDIA GeForce GTX 1080 Ti with 12 GB memory. When I use an input shape B, T, F = 32, 300, 129, I get memory error for the current implementation and no memory error for the suggested implementation. Hence, I wonder if just a Reshape is "correct" implementation for the original Chimera model.

leichtrhino commented 5 years ago

I confirmed that suggested inplementation does not occur memory error while current one does, although the environment differs.

Speaking about the original Chimera implementation, I thought that F Dense(C, ...) layers constitute mask inference layer, from third paragraph of section 2.2 on Chimera paper. So I implemented in split/merge way.

On the order of activation and normalization, I selected this order as in the second paragraph of section 2.2. Also, I thought that each row of V (TFxD embedding matrix) is a unit vector.

I'd like to suggest followings:

Thanks.