kundtx / lfd2022-comments

0 stars 0 forks source link

Learning from Data (Fall 2022) #39

Open kundtx opened 1 year ago

kundtx commented 1 year ago

http://8.129.175.102/lfd2022fall-poster-session/31.html

Prof-Greatfellow commented 1 year ago

G1 Haizhou Liu: Very good work on learning adolescent mental health from speech data! There might be a spelling mistake in the word "utterance". Also, is there any possible reason why the LSTMs outperform GRUs?

MaoShiwei commented 1 year ago

G21 Shiwei Mao: Good job and nice idea! I think it would be better to add name for each formula and thus i am little confused about the process of standardization.

wyqw commented 1 year ago

@Prof-Greatfellow G1 Haizhou Liu: Very good work on learning adolescent mental health from speech data! There might be a spelling mistake in the word "utterance". Also, is there any possible reason why the LSTMs outperform GRUs?

G31 Yuqi Wang: Thanks for your reminder! Yes, we made a spelling mistake in the word "utterance", and we have corrected the typo on the poster. And great question! I can list some possible reasons here: LSTM has one more gate than GRU, which is more effective in controlling the flow of information; LSTM has more parameters than GRU, it is more powerful and expressive, so the fitting effect is slightly better. Sincerely thank you for your reading and communicating.

wyqw commented 1 year ago

@MaoShiwei G21 Shiwei Mao: Good job and nice idea! I think it would be better to add name for each formula and thus i am little confused about the process of standardization.

G31 Yuqi Wang: Thanks for your advice! I could explain a bit more here. The three formulas in Figure 2 are used to calculate the three thresholds that need to be selected for endpoint detection based on the double-threshold method in this task. After many attempts, we finally choose: for each utterance, take half of the short-term energy average of all frames as the higher energy threshold HE. The short-term energy average of the first 10 frames and the last 10 frames is averaged with HE, and then multiplied by 1/2 as the lower energy threshold LE. Three times the average number of zero-crossing rates of each frame in the first 10 frames and the last 10 frames is used as the threshold of the short-term zero-crossing rate ZC. The above are the names and meanings of the abbreviations in the formulas. In addition, regarding standardization, the speech features of each utterance are 384/1582-dimensional advanced statistical features extracted using openSMILE, and the original feature value ranges of different features vary greatly. Therefore, for each teenager, we first calculate the mean and standard deviation of the five speech data features, then use Z-score standardization to standardize the extracted feature of each teenager. Sincerely thank you for your reading and communicating.

ErlindaQiao commented 1 year ago

G3 Xizi Qiao: Very impressive and meaningful work! May I ask a few questions: (1) What are the features been extracted? (2) Are the features extracted before used as the input of deep learning models? Or feature extractor is trained in the model?

aqaaqaaqa commented 1 year ago

@ErlindaQiao G3 Xizi Qiao: Very impressive and meaningful work! May I ask a few questions: (1) What are the features been extracted? (2) Are the features extracted before used as the input of deep learning models? Or feature extractor is trained in the model?

G31 Shurui Bai: I will use the following passage to answer both of your questions at the same time. We extract features in two steps. In the first step, we used openSMILE3.0 to extract the advanced statistical features of each utterance, which are the IS09 and IS10 features of the adolescent speech data set. The 2009 InterSpeech Challenge feature set (IS09) contains a total of 384 HSFs features selected by experts, which are obtained through further calculation of 16 LLD features such as zero-crossing rate, energy square root, F0, HNR, MFCC1-12. Similarly, IS10 has a total of 1582 features. Then, the single audio feature extracted by opensmile is used as the input of the deep learning model. In the second step we use deep learning model, and the contextual feature extractor is trained in the model. LSTM/GRU is used to learn the contextual features between each adolescent's five tterances in response time order. Next, a series of fully connected layers are used to learn deeper information, and then the final classification result is obtained through softmax to realize the identification of adolescent mental health status. Sincerely thank you for reading and communicating.