Closed NathanielF closed 1 year ago
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
I think this is ready for review now. It's quite long and covers a number of approaches to imputation.
(I) We discuss the taxonomies of missing-ness (MCAR), (MAR) and (MNAR). I try to set it up as a prelude to considerations about causal inference.
(ii) FIML and MLE approaches to estimating a multivariate model given missing data (iii) Bayesian imputation of missing values using the multivariate gaussian and the posterior predictive distribution (iv) Two examples of imputation using sequential regression equations
Each of the approaches so far is presented in the Enders book and our estimates match those presented there.
(v) I apply the missing data imputation to hierarchical model and estimate the values of the missing data informed by the structure of "team" clusters in our employee data set. The model is estimated using the blackjax sampler and shows divergences, but converges nicely with good Rhat numbers...,. I use the differences in imputation patterns between the hierarchical model and the simpler regression models to argue for why we need to be aware of heterogenous patterns of imputation and how this is analogous to concerns in causal inference of heterogenous treatment effects.
We finish on a wrap up and celebration of the flexibility of bayesian modelling in an enterprise that has work with confounding and complexity.
View / edit / reply to this conversation on ReviewNB
fonnesbeck commented on 2023-01-24T02:41:15Z ----------------------------------------------------------------
The table looks janky. Does it need to be placed in a code block to enforce monospace?
NathanielF commented on 2023-01-24T10:55:40Z ----------------------------------------------------------------
Fair. It was a bit needless. I've taken another approach just adding the patterns of missing-ness as a pandas dataframe:
View / edit / reply to this conversation on ReviewNB
fonnesbeck commented on 2023-01-24T02:41:16Z ----------------------------------------------------------------
Should add a legend if possible.
NathanielF commented on 2023-01-24T11:02:17Z ----------------------------------------------------------------
Done.
View / edit / reply to this conversation on ReviewNB
fonnesbeck commented on 2023-01-24T02:41:17Z ----------------------------------------------------------------
Perhaps add a sentence or two interpreting these plots?
NathanielF commented on 2023-01-24T10:56:09Z ----------------------------------------------------------------
Updated and added some more explanatory text
View / edit / reply to this conversation on ReviewNB
fonnesbeck commented on 2023-01-24T02:41:18Z ----------------------------------------------------------------
Line #15. pm.Potential("x_logp", pm.logp(rv=pm.MvNormal.dist(mus, chol=cov_flat_prior), value=x))
Why are potentials being constructed here rather than just imputing with the MvNormal likelihood? Does that not work anymore? (perhaps I'm missing something obvious)
Yes, i think it's broken or not implemented in the latest version. I was getting the same error discussed here: https://discourse.pymc.io/t/automatic-imputation-of-multivariate-models/11029
View / edit / reply to this conversation on ReviewNB
fonnesbeck commented on 2023-01-24T02:41:19Z ----------------------------------------------------------------
Lower case y in "PyMC"
NathanielF commented on 2023-01-24T10:57:43Z ----------------------------------------------------------------
Adjusted!
View / edit / reply to this conversation on ReviewNB
fonnesbeck commented on 2023-01-24T02:41:19Z ----------------------------------------------------------------
I'm not sure printing out the entire idata object is helpful, given how large and verbose it is. Maybe pull a few elements that are interesting?
NathanielF commented on 2023-01-24T10:59:00Z ----------------------------------------------------------------
Removed the idata_uniform entirely as it was a bit overkill. I left the idata_normal. I like having the ability to inspect the model output. Makes reproductions easier to check for consistency.
Great tutorial!
Fair. It was a bit needless. I've taken another approach just adding the patterns of missing-ness as a pandas dataframe:
View entire conversation on ReviewNB
Yes, i think it's broken or not implemented in the latest version. I was getting the same error discussed here: https://discourse.pymc.io/t/automatic-imputation-of-multivariate-models/11029
View entire conversation on ReviewNB
Removed the idata_uniform entirely as it was a bit overkill. I left the idata_normal. I like having the ability to inspect the model output. Makes reproductions easier to check for consistency.
View entire conversation on ReviewNB
Thank you for taking the time to review!! Glad you liked it.
Just giving this a little nudge @drbenvincent. Hope your weekend move went smoothly!?
View / edit / reply to this conversation on ReviewNB
drbenvincent commented on 2023-02-01T10:53:36Z ----------------------------------------------------------------
Update to February
NathanielF commented on 2023-02-01T19:17:55Z ----------------------------------------------------------------
Done
View / edit / reply to this conversation on ReviewNB
drbenvincent commented on 2023-02-01T10:53:37Z ----------------------------------------------------------------
There's no content in this intro section. Maybe just delete heading?
NathanielF commented on 2023-02-01T19:39:42Z ----------------------------------------------------------------
Done
View / edit / reply to this conversation on ReviewNB
drbenvincent commented on 2023-02-01T10:53:38Z ----------------------------------------------------------------
This might be a good opportunity to cross link to the notebook on censoring and truncation as a different kind of missingness https://www.pymc.io/projects/examples/en/latest/generalized_linear_models/GLM-truncated-censored-regression.html
NathanielF commented on 2023-02-01T19:39:24Z ----------------------------------------------------------------
Linked to that notebook too.
drbenvincent commented on 2023-02-03T09:10:17Z ----------------------------------------------------------------
Sorry if I'm missing it, but can't see a reference to the example
NathanielF commented on 2023-02-03T09:19:59Z ----------------------------------------------------------------
Just above introducing the employee data set below the MNAR definition
drbenvincent commented on 2023-02-03T09:23:41Z ----------------------------------------------------------------
👍🏻
View / edit / reply to this conversation on ReviewNB
drbenvincent commented on 2023-02-01T10:53:39Z ----------------------------------------------------------------
Should the Employee Satisfaction Surveys be a L2 header?
NathanielF commented on 2023-02-01T19:30:45Z ----------------------------------------------------------------
Yes
View / edit / reply to this conversation on ReviewNB
drbenvincent commented on 2023-02-01T10:53:40Z ----------------------------------------------------------------
The whole notebook uses this double hash for code comments. Isn't that a bit atypical, compared to a single hash? It's messing with my mind, thinking that it was intended to be a L2 markdown heading. I'd also recommend the standard single # so that it's consistent with the other notebooks
NathanielF commented on 2023-02-01T19:32:06Z ----------------------------------------------------------------
Changed this
drbenvincent commented on 2023-02-03T09:13:35Z ----------------------------------------------------------------
Thanks! A quick find shows up some remaining examples which are not actual L2 markdown headings. ## Percentage Missing in this cell. A bunch in cell 31, one in cell 11.
NathanielF commented on 2023-02-03T09:26:53Z ----------------------------------------------------------------
Agh... sorry. I think i've got them all now.
View / edit / reply to this conversation on ReviewNB
drbenvincent commented on 2023-02-01T10:53:41Z ----------------------------------------------------------------
Could be worth having brief text after the figure to highlight the relevant missingness about these distributions. At a glance, it's not obvious what is going on here in terms of missing data vs bad binning of the histograms.
I see another comment about a legend here, but not seeing one. Did it get committed?
NathanielF commented on 2023-02-01T19:31:47Z ----------------------------------------------------------------
I thought he just meant the legend in the picture i.e. the color labels for Empowerment etc... which were missing at the time he commented but are there now for me.
drbenvincent commented on 2023-02-03T09:14:42Z ----------------------------------------------------------------
Ah yes, Chris meant legend, but I meant figure caption :)
Perfect, thanks @drbenvincent. Will adjust this evening.
I thought he just meant the legend in the picture i.e. the color labels for Empowerment etc... which were missing at the time he commented but are there now for me.
View entire conversation on ReviewNB
That should be good to go now @drbenvincent. I've tidied a few things and added some more explanatory text to sign post what i'm doing a bit more. I think i've also addressed all comments above.
Sorry if I'm missing it, but can't see a reference to the example
View entire conversation on ReviewNB
Thanks! A quick find shows up some remaining examples which are not actual L2 markdown headings. ## Percentage Missing in this cell. A bunch in cell 31, one in cell 11.
View entire conversation on ReviewNB
Just above introducing the employee data set below the MNAR definition
View entire conversation on ReviewNB
A notebook on Missing Data methods and Bayesian imputation
Related to https://github.com/pymc-devs/pymc-examples/issues/461
This notebook aims to showcase methods for imputation of missing data using primarily bayesian methods. We will focus on a dataset which records employee satisfaction metrics drawn from the book Applied Missing Data Analysis. We will demonstrate how FIML and Bayesian imputation methods work using the Multivariate normal distribution differ and we also want to show how approximate the multivariate distribution using the sequential chained equation methods.