Closed louislau1129 closed 2 months ago
Thank you very much for your advice and for taking the time to identify potential issues in the implementation. I apologize for the delay in responding as I had other matters to attend to.
Preprocessing Layer Output Dimension: I have corrected the output dimension of the preprocessing layer. It now projects the pretrained wav2vec2 features to a lower hidden dimension of 32.
Normalization of Predicted and Label Scores: Regarding the label scores, the multiplication by 0.2 is intentional. This is done to project the raw fluency label score, which ranges from 0 to 10, to a 0 to 2 scale.
With some adjustments, the model is now delivering rather expected results. Thanks a lot.
First of all, thank you for making efforts to implement this paper. Regarding the unsatifised experimental results, some possible implementation issues are identified to cause this performance gap.
The output dimension of the preprocessing layer should not be equal to the input dimension. See https://github.com/tangYang7/fluency_scorer/blob/main/models/fluScorer.py#L108 Instead, this layer should project the pretrained wav2vec2 features to a much lower hidden dimension, i.e., 32. And the dimension of the BLSTM should be modified accordingly.
The mask is not implemented as desired, e.g., there is no mask passing into the pooling function at https://github.com/tangYang7/fluency_scorer/blob/main/models/fluScorer.py#L56.
The predicted score and label score seem to be not well normalized. The raw fluency label score ranges in [0, 10], not clear why multiply it with 0.2 at https://github.com/tangYang7/fluency_scorer/blob/main/train.py#L313.
Hope these would help you obtain a satifised result.