How to get 8050 training examples (subject responses)?

clintonlau commented 3 years ago

I really appreciate that you published your code!

I am currently trying to replicate your feature generation process. Could you please elaborate on your process of narrowing down to the 8,050 examples from the training set? My understanding is that they are only the subject's responses to Ellie's queries. I am having difficulty in getting the exact number of training examples that you have.

Thanks in advance!

talhanai commented 3 years ago

Hi @clintonlau ,

I can't recall how many lines you would get with | grep "Participant" but I think it would be more than 8,050 lines. That's because - if I recall correctly - there could be consecutive lines that start with "Participant" which I collapsed into a single line. Is that the case? (I don't have access to the original or processed data to check).

clintonlau commented 3 years ago

Hi @talhanai ,

That's correct! In the raw transcripts, "Participant" under 'speaker' column can appear consecutively for a single response, and just by counting the number of "Participant" occurrences yields 16,906 lines in the training set.

So then, I blindly grouped consecutive "Participant" occurrences in the transcripts as a single count (purely based on the 'speaker' column, which I convert to a list and process just that), but this only yielded 6,218 lines in the training set :sweat_smile:. So I thought perhaps I missed some criteria (e.g. what constitutes a response) in the parsing process. Hopefully this rings a bell, I would really appreciate any advice and guidance!

talhanai commented 3 years ago

Hi @clintonlau ,

From the paper it says "we generated embeddings of individual responses to all queries and the queries themselves, for a total of 8,050 training examples, ..." so I think 8,050 also counts the lines with "Ellie:".

clintonlau commented 3 years ago

Hi @talhanai ,

My confusion with including Ellie's 170 unique queries in the "training examples" is that after describing the audio feature selection process in the Audio Features section (4.1.3), the paper says "..., thus resulting in a subset of 279 features and 8,050 examples (responses)". This in a way implies that the 8,050 examples are solely the subject's response, right? Since the audio features (COVAREP feature set) do not represent Ellie's queries.

I could see two possible cases to explain the discrepancies though they aren't flawless:

case 1) for doc2vec, embeddings are trained on 8,050 question-response pairs, while audio features decouple each 8,050 q-r pair and only use the response part. However, I have only been able to locate ~6200 q-r pairs.

case 2) doc2vec is trained with questions and responses separately (8,050 counts the lines with "Ellie:" like you have suggested). Then, it would at least double the ~6,200 subject responses that I have found so far...

talhanai commented 3 years ago

Hi @clintonlau ,

This in a way implies that the 8,050 examples are solely the subject's response, right? Correct.

I follow you. The transcripts needed some manual cleaning. The other thing I can think if is if there were cases where "Ellie" and "Participant" query-response pairs were wrapped into a single line, thus reducing the number of pairs leading you to find 6,200. Alternatively, since the difference between 6,200 and 8,050 is a little large I wonder if the 'training' was a typo, and it combines training + testing numbers. (Note, training was only performed with the training set).

clintonlau commented 3 years ago

Hi @talhanai ,

Thank you for your prompt response! I just included the "Participant" response count from the development set and it came to 8,080 :grin: So I think I am on the right track now, thanks for the tip!

Could I get some confirmation/clarification on two things before I close this thread?

Each query and its corresponding response are paired up as one "document" to train for doc2vec embeddings. (unlike audio features which only uses the subject's response which you have confirmed above). If this statement is correct, then these doc2vec embeddings (query+response-level) and audio features (response-level) are used to train the LSTM model, as well as the logistic regression models.
Given that logistic regression is a statistical model and by inferring from the figure below in one of your slides, when evaluating whether a subject is depressed or not, you are using the logistic regression model to get a depression probability per subject's response, then you multiply all the response-level probabilities within this subject to get a final probability to classify the subject's depressive state?

talhanai commented 3 years ago

Hi @clintonlau ,

That's good that the train + dev set combined seems to resolve the number of examples.

Yes, I believe so.
You are correct that each probability is at the subject response-level. Each response was weighted according to the predictive power of the given query. The mean of the weighted probabilities is then calculated for the subject-level probability.
- Note that the weight is the AUC for a given query's set of responses (where each response belongs to a subject in the training set).

I hope that was helpful.

clintonlau commented 3 years ago

thanks for your clarifications!

talhanai / redbud-tree-depression

How to get 8050 training examples (subject responses)? #8