Open inespancorbo opened 3 years ago
Do we want to condition our variables with missing values on the predictors that affect missingness. Note this paragraph from van Buuren:
Note that the label “ignorable” does not mean that we can be entirely careless about the missing data. For inferences to be valid, we need to condition on those factors that influence the missing data rate. For example, in the MAR example of Section 2.2.4 the missingness in Y2 depends on Y1. A valid estimate of the mean of Y2 cannot be made without Y1, so we should include Y1 somehow into the calculations for the mean of Y2.
What if we do higher_yes
and absences
for the missing data? That way we have one categorical and on continuous predictor that have missing values.
I did a little EDA and tried to find a reasonable OLS model based on theory and a desire to avoid correlated predictors. This seems reasonable:
> summary(lm(G3 ~ age + failures + sex + higher + Medu + absences, data = grades))
Call:
lm(formula = G3 ~ age + failures + sex + higher + Medu + absences,
data = grades)
Residuals:
Min 1Q Median 3Q Max
-12.4297 -2.1098 0.2417 2.9036 9.0457
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.09318 3.32015 3.341 0.000915 ***
age -0.21844 0.17650 -1.238 0.216618
failures -1.88391 0.30929 -6.091 2.7e-09 ***
sexM 1.13744 0.43266 2.629 0.008906 **
higheryes 1.72756 1.03997 1.661 0.097490 .
Medu 0.43701 0.20436 2.138 0.033104 *
absences 0.03828 0.02723 1.406 0.160462
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.199 on 388 degrees of freedom
Multiple R-squared: 0.173, Adjusted R-squared: 0.1602
F-statistic: 13.52 on 6 and 388 DF, p-value: 6.015e-14
Nice! I’m surprised G1 G2 didn’t make it. I’m good with those predictors and with higher_yes and absences for missing if Luke is ok too. What prior would you give absences though? The first thing that comes to mind is Poisson which would still be a discrete distribution.
Oh, I didn't consider G1
and G2
. I suppose it depends on the goal of our model!
Do we want to condition our variables with missing values on the predictors that affect missingness. Note this paragraph from van Buuren:
Note that the label “ignorable” does not mean that we can be entirely careless about the missing data. For inferences to be valid, we need to condition on those factors that influence the missing data rate. For example, in the MAR example of Section 2.2.4 the missingness in Y2 depends on Y1. A valid estimate of the mean of Y2 cannot be made without Y1, so we should include Y1 somehow into the calculations for the mean of Y2.
Yeah, MAR ignorable would depend on observable variables (either other predictors or the response). That’s what I understood (but not super familiar). So I conditioned the priors on XJ — all the observed predictors. But not too sure if that is what you mean.
Randomly drop X% of values for higher
and absences
. Estimate the model on the remaining complete cases.
higher
is missing with probability conditional on age. absences
is missing with probability conditional on age. higher
is missing with x% probability if higher == "yes"
and y% probability if higher == "no"
absences
is missing with x% probability if absences < 6
and y% probability if absences >= 6
Oh, I didn't consider
G1
andG2
. I suppose it depends on the goal of our model!
Im good with the ones you have. If Luke is ok too
MCAR
Randomly drop X% of values for
higher
andabsences
. Estimate the model on the remaining complete cases.MAR
higher
is missing with probability conditional on age.absences
is missing with probability conditional on age.MNAR
higher
is missing with x% probability ifhigher == "yes"
and y% probability ifhigher == "no"
absences
is missing with x% probability ifabsences < 6
and y% probability ifabsences >= 6
I like these scenarios. The missing mechanism/prob distribution you wrote down should only be included In the MNAR scenario? Is that the main difference between MAR and MNAR - we can disregard this missing mechanism in the first case but. Cannot in the latter?
Those work for me. My only concern is where did they come from? I remember Professor Meyer saying that he really doesn't want to see classical statistical analysis.
For
absences
is missing with x% probability ifabsences < 6
and y% probability if absences>= 6
Could we change this to be probability proportional to absences?
absences
is missing withp*absences
% probability
That way we take advantage of the continuous nature of the data as opposed to turning it into a binary.
I was thinking this could make sense given the nature of the data until I started deriving the full posterior and finding the conditionals ... the full conditionals for Absences_i, Higher_yes_i, beta, and sigma2 are straightforward -- they're recognizable (poisson, Bernoulli, MVN, and inverse gamma respectively). The full conditionals for alpha0, alpha1, gamma0, gamma1 are beasts and not recognizable. There are packages like PyMC that would do this for you but do not know how to use them and not sure we can for the project.
The full conditionals for alpha0, alpha1, gamma0, gamma1 are beasts and not recognizable
I see two ways to approach this
What are your thoughts?
Yeah, hopefully we can stick with 1. But I guess we can resort to 2 if not possible. I attached the derivations. Let me know what you think and if it makes sense or if there are typos. I think the entire sampler (Gibbs and MH if possible) is going to be computationally intensive because our n is roughly 400 -- vectorizing stuff might make it quicker but we'll see.
Also, for the proposals for the alphas and gammas we can use a normal with mean being the value of the prior iteration and variance tuned to get a particular acceptance rate. We can use a normal because the alphas and gammas can be any number (-infinity, inifinity).
Updated notes for the scenario where we choose either G1 or G2 as the other variable with missing data. For you guys to see if there are any typos. I'll code up the scenario with G1/G2 as the other variable with missing data.
Sorry if I am missing this. How did you pick the priors for alpha0, alpha1, gamma0, and gamma1?
You mean why assumed normal N(0, 100) in the notes I attached?
Exactly
Yeah I assumed that bc wanted to give a noninformative prior -- I dont think alphas and gammas are restricted so can take any number in the real line (so thought a normal with large variance would make sense). But we can change this, which will lead to a different model, and then compare models. I'm open to this if there is time.
Also haha please make sure the model I wrote down makes sense, I made a lot of independence assumptions .. and I might have messed up some algebra somewhere too.
Other than feature selection we need to think about the model itself. I wrote some stuff down (see pic). Just a suggestion so we can get started. So far picked two features for missing data and used all other features. The notebook with that is in my branch -- for some reason the latex is not showing up when pushed to github so added a pic. I believe the below would be for a MAR/ignorable.