lbotti319 / bayesian_math_scores

0 stars 1 forks source link

Model #7

Open inespancorbo opened 3 years ago

inespancorbo commented 3 years ago

Other than feature selection we need to think about the model itself. I wrote some stuff down (see pic). Just a suggestion so we can get started. So far picked two features for missing data and used all other features. The notebook with that is in my branch -- for some reason the latex is not showing up when pushed to github so added a pic. I believe the below would be for a MAR/ignorable.

Screen Shot 2021-05-08 at 11 30 25 AM
awunderground commented 3 years ago

Do we want to condition our variables with missing values on the predictors that affect missingness. Note this paragraph from van Buuren:

Note that the label “ignorable” does not mean that we can be entirely careless about the missing data. For inferences to be valid, we need to condition on those factors that influence the missing data rate. For example, in the MAR example of Section 2.2.4 the missingness in Y2 depends on Y1. A valid estimate of the mean of Y2 cannot be made without Y1, so we should include Y1 somehow into the calculations for the mean of Y2.

awunderground commented 3 years ago

What if we do higher_yes and absences for the missing data? That way we have one categorical and on continuous predictor that have missing values.

awunderground commented 3 years ago

I did a little EDA and tried to find a reasonable OLS model based on theory and a desire to avoid correlated predictors. This seems reasonable:

> summary(lm(G3 ~ age + failures + sex + higher + Medu + absences, data = grades))

Call:
lm(formula = G3 ~ age + failures + sex + higher + Medu + absences, 
    data = grades)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.4297  -2.1098   0.2417   2.9036   9.0457 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 11.09318    3.32015   3.341 0.000915 ***
age         -0.21844    0.17650  -1.238 0.216618    
failures    -1.88391    0.30929  -6.091  2.7e-09 ***
sexM         1.13744    0.43266   2.629 0.008906 ** 
higheryes    1.72756    1.03997   1.661 0.097490 .  
Medu         0.43701    0.20436   2.138 0.033104 *  
absences     0.03828    0.02723   1.406 0.160462    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.199 on 388 degrees of freedom
Multiple R-squared:  0.173, Adjusted R-squared:  0.1602 
F-statistic: 13.52 on 6 and 388 DF,  p-value: 6.015e-14
inespancorbo commented 3 years ago

Nice! I’m surprised G1 G2 didn’t make it. I’m good with those predictors and with higher_yes and absences for missing if Luke is ok too. What prior would you give absences though? The first thing that comes to mind is Poisson which would still be a discrete distribution.

awunderground commented 3 years ago

Oh, I didn't consider G1 and G2. I suppose it depends on the goal of our model!

inespancorbo commented 3 years ago

Do we want to condition our variables with missing values on the predictors that affect missingness. Note this paragraph from van Buuren:

Note that the label “ignorable” does not mean that we can be entirely careless about the missing data. For inferences to be valid, we need to condition on those factors that influence the missing data rate. For example, in the MAR example of Section 2.2.4 the missingness in Y2 depends on Y1. A valid estimate of the mean of Y2 cannot be made without Y1, so we should include Y1 somehow into the calculations for the mean of Y2.

Yeah, MAR ignorable would depend on observable variables (either other predictors or the response). That’s what I understood (but not super familiar). So I conditioned the priors on XJ — all the observed predictors. But not too sure if that is what you mean.

awunderground commented 3 years ago

MCAR

Randomly drop X% of values for higher and absences. Estimate the model on the remaining complete cases.

MAR

MNAR

inespancorbo commented 3 years ago

Oh, I didn't consider G1 and G2. I suppose it depends on the goal of our model!

Im good with the ones you have. If Luke is ok too

MCAR

Randomly drop X% of values for higher and absences. Estimate the model on the remaining complete cases.

MAR

  • higher is missing with probability conditional on age.
  • absences is missing with probability conditional on age.

MNAR

  • higher is missing with x% probability if higher == "yes" and y% probability if higher == "no"
  • absences is missing with x% probability if absences < 6 and y% probability if absences >= 6

I like these scenarios. The missing mechanism/prob distribution you wrote down should only be included In the MNAR scenario? Is that the main difference between MAR and MNAR - we can disregard this missing mechanism in the first case but. Cannot in the latter?

lbotti319 commented 3 years ago

Those work for me. My only concern is where did they come from? I remember Professor Meyer saying that he really doesn't want to see classical statistical analysis.

lbotti319 commented 3 years ago

For

absences is missing with x% probability if absences < 6 and y% probability if absences >= 6

Could we change this to be probability proportional to absences?

absences is missing with p*absences % probability

That way we take advantage of the continuous nature of the data as opposed to turning it into a binary.

inespancorbo commented 3 years ago
Screen Shot 2021-05-08 at 5 49 35 PM Screen Shot 2021-05-08 at 5 52 30 PM

I was thinking this could make sense given the nature of the data until I started deriving the full posterior and finding the conditionals ... the full conditionals for Absences_i, Higher_yes_i, beta, and sigma2 are straightforward -- they're recognizable (poisson, Bernoulli, MVN, and inverse gamma respectively). The full conditionals for alpha0, alpha1, gamma0, gamma1 are beasts and not recognizable. There are packages like PyMC that would do this for you but do not know how to use them and not sure we can for the project.

lbotti319 commented 3 years ago

The full conditionals for alpha0, alpha1, gamma0, gamma1 are beasts and not recognizable

I see two ways to approach this

  1. Use the M-H algorithm assuming we can figure out a reasonable proposal distribution
  2. Do some EDA and try to fit basic models to p(higher|age) and p(absences|age), then say something along the lines of "for the sake of simplicity assume that alpha0 is x alpha1 is y...". Basically, they could be priors if we want, but they don't have to be either.

What are your thoughts?

inespancorbo commented 3 years ago

Note May 8, 2021.pdf

Yeah, hopefully we can stick with 1. But I guess we can resort to 2 if not possible. I attached the derivations. Let me know what you think and if it makes sense or if there are typos. I think the entire sampler (Gibbs and MH if possible) is going to be computationally intensive because our n is roughly 400 -- vectorizing stuff might make it quicker but we'll see.

inespancorbo commented 3 years ago

Also, for the proposals for the alphas and gammas we can use a normal with mean being the value of the prior iteration and variance tuned to get a particular acceptance rate. We can use a normal because the alphas and gammas can be any number (-infinity, inifinity).

inespancorbo commented 3 years ago

Note May 8, 2023.pdf

Updated notes for the scenario where we choose either G1 or G2 as the other variable with missing data. For you guys to see if there are any typos. I'll code up the scenario with G1/G2 as the other variable with missing data.

awunderground commented 3 years ago

Sorry if I am missing this. How did you pick the priors for alpha0, alpha1, gamma0, and gamma1?

inespancorbo commented 3 years ago

You mean why assumed normal N(0, 100) in the notes I attached?

awunderground commented 3 years ago
Screen Shot 2021-05-10 at 7 48 11 PM
awunderground commented 3 years ago

Exactly

inespancorbo commented 3 years ago

Yeah I assumed that bc wanted to give a noninformative prior -- I dont think alphas and gammas are restricted so can take any number in the real line (so thought a normal with large variance would make sense). But we can change this, which will lead to a different model, and then compare models. I'm open to this if there is time.

inespancorbo commented 3 years ago

Also haha please make sure the model I wrote down makes sense, I made a lot of independence assumptions .. and I might have messed up some algebra somewhere too.