talhanai / redbud-tree-depression

scripts to model depression in speech and text
70 stars 30 forks source link

How to exclude some features ? #2

Closed zhiyongww closed 4 years ago

zhiyongww commented 5 years ago

"From the initial set of 553 features, we excluded all features without a statistically significant univariate correlation with outcomes on the training set (|ρ| < 1e-01, p > 1e-02) nor a significant L1 regularized logistic regression model coefficient (|β| < 1e-04), thus resulting in a subset of 279 features and 8,050 examples (responses)"

How to exclude some features to get a subset of 279 features ?

talhanai commented 4 years ago
  1. You should perform pearson correlation between the features and the outcome (on the training set). Any feature that is less than some threshold for the correlation coefficient (|ρ| < 1e-01) and above some statistical significance (p-val > 1e-02), drop it.
  2. You should take the features you kept and train an L1 regularized model (over the training data). Any features with model coefficients that are almost zero (|β| < 1e-04) drop them.

You can adjust the thresholds as you like.

I hope that clarifies it.

clintonlau commented 3 years ago

Can I ask what the parsing process was to get the 8,050 examples? Using the transcripts from the training set, I am counting the number of times 'Participant' appears as the speaker while counting consecutive 'Participant' turns as just one example (since they're essentially one response to a question), but this only yields me around 6200 examples. Any advice would be much appreciated!