Closed clintonlau closed 3 years ago
Hi @clintonlau ,
I can't recall how many lines you would get with | grep "Participant"
but I think it would be more than 8,050 lines. That's because - if I recall correctly - there could be consecutive lines that start with "Participant" which I collapsed into a single line. Is that the case? (I don't have access to the original or processed data to check).
Hi @talhanai ,
That's correct! In the raw transcripts, "Participant" under 'speaker' column can appear consecutively for a single response, and just by counting the number of "Participant" occurrences yields 16,906 lines in the training set.
So then, I blindly grouped consecutive "Participant" occurrences in the transcripts as a single count (purely based on the 'speaker' column, which I convert to a list and process just that), but this only yielded 6,218 lines in the training set :sweat_smile:. So I thought perhaps I missed some criteria (e.g. what constitutes a response) in the parsing process. Hopefully this rings a bell, I would really appreciate any advice and guidance!
Hi @clintonlau ,
From the paper it says "we generated embeddings of individual responses to all queries and the queries themselves, for a total of 8,050 training examples, ..." so I think 8,050 also counts the lines with "Ellie:".
Hi @talhanai ,
My confusion with including Ellie's 170 unique queries in the "training examples" is that after describing the audio feature selection process in the Audio Features section (4.1.3), the paper says "..., thus resulting in a subset of 279 features and 8,050 examples (responses)". This in a way implies that the 8,050 examples are solely the subject's response, right? Since the audio features (COVAREP feature set) do not represent Ellie's queries.
I could see two possible cases to explain the discrepancies though they aren't flawless:
case 1) for doc2vec, embeddings are trained on 8,050 question-response pairs, while audio features decouple each 8,050 q-r pair and only use the response part. However, I have only been able to locate ~6200 q-r pairs.
case 2) doc2vec is trained with questions and responses separately (8,050 counts the lines with "Ellie:" like you have suggested). Then, it would at least double the ~6,200 subject responses that I have found so far...
Hi @clintonlau ,
This in a way implies that the 8,050 examples are solely the subject's response, right? Correct.
I follow you. The transcripts needed some manual cleaning. The other thing I can think if is if there were cases where "Ellie" and "Participant" query-response pairs were wrapped into a single line, thus reducing the number of pairs leading you to find 6,200. Alternatively, since the difference between 6,200 and 8,050 is a little large I wonder if the 'training' was a typo, and it combines training + testing numbers. (Note, training was only performed with the training set).
Hi @talhanai ,
Thank you for your prompt response! I just included the "Participant" response count from the development set and it came to 8,080 :grin: So I think I am on the right track now, thanks for the tip!
Could I get some confirmation/clarification on two things before I close this thread?
Each query and its corresponding response are paired up as one "document" to train for doc2vec embeddings. (unlike audio features which only uses the subject's response which you have confirmed above). If this statement is correct, then these doc2vec embeddings (query+response-level) and audio features (response-level) are used to train the LSTM model, as well as the logistic regression models.
Given that logistic regression is a statistical model and by inferring from the figure below in one of your slides, when evaluating whether a subject is depressed or not, you are using the logistic regression model to get a depression probability per subject's response, then you multiply all the response-level probabilities within this subject to get a final probability to classify the subject's depressive state?
Hi @clintonlau ,
That's good that the train + dev set combined seems to resolve the number of examples.
I hope that was helpful.
thanks for your clarifications!
I really appreciate that you published your code!
I am currently trying to replicate your feature generation process. Could you please elaborate on your process of narrowing down to the 8,050 examples from the training set? My understanding is that they are only the subject's responses to Ellie's queries. I am having difficulty in getting the exact number of training examples that you have.
Thanks in advance!