ld-archer / E_FEM

This is the repository for the English version of the Future Elderly Model, originally developed at the Leonard D. Schaeffer Center for Health Policy and Microsimulation.
MIT License
3 stars 1 forks source link

Investigate BMI #41

Closed ld-archer closed 3 years ago

ld-archer commented 3 years ago

BMI is a bit of a problem in the current version of the model.

Firstly, BMI is was only measured and recorded on even waves in ELSA, which means we have NO data for odd waves. On top of that, there is also a considerable proportion of BMI data missing for the even waves, hovering between 40-60% missing (ridiculous).

To handle the missing data, we first impute BMI for even waves using hotdecking. Then, we interpolate the odd waves using even waves. This already introduces some problems, for example, people who have a low BMI in wave 2 (e.g. 16) and a higher BMI in wave 4 (e.g. 22), the values for wave 1 are impossibly low due to interpolating outside the range of known data.

Also, after interpolating we add some noise using a normal distribution. We take a sample of a normal distribution with limits (-2, 2), and add this to the interpolated data. We did this in the first place because the lag of bmi was almost a perfect predictor of current, which shouldn't be the case.

ToDo

ld-archer commented 3 years ago

Update

Comparing Means

No significant difference in the weighted means of these pop.s (stock_CV1 & transition):

stock_CV1

    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
         bmi |   3,814  13275679.6     28.0638   4.979973    13.8869    56.1539
Transition
    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
         bmi |  50,937   49099.014     27.5049   5.577009   .8737087   84.80872

Something of note however is the minimum and maximum values. 13.8 is dangerously low, and I think would be considered anorexic in most people. Is this realistic then? Should we remove it? Obviously the min value of 0.874 if impossible and will need to be handled. This should be handled in reshape_long so we only need to do it once, and I think the culprit is the interpolation step.

RMSE

Removing this term made things worse. The average BMI dropped by approximately half a point for both CV and minimal runs, think this means we should keep it.

Transition Models

Made big changes to the logbmi transition model. Removed l2logbmi as a predictor, and improved massively. Still experimenting with the transitions but we went from ending with a mean BMI of 34 in wave 8, to mean BMI of 28 over the same time. The p value for wave 8 T-test is also above 0.05.

ld-archer commented 3 years ago

Low BMI

Comparing the summary before and after imputation, the problem is there beforehand, but made worse by the imputation step: ELSA_long.dta Before

    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
         bmi |  28,663  27758.8844     28.2397   5.262646   2.910022   71.11111

After

    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
         bmi |  56,659  54750.5507    27.34451   5.603264   .8737087   84.80872
ld-archer commented 3 years ago

Final Update

Low values (> 14) have been removed AFTER the interpolation step. This step was causing the low values (and some high ones as well, but not as troublesome), so now the low values are removed and imputed in kludge.do by hotdecking. Showing logbmi as this is the variable that is imputed in kludge.do. Should have been showing that for every time I compared stock to transition but the final values are good, so I'm happy.

stock_CV1
    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
      logbmi |   4,405    15428835    3.321269   .1730587   2.699424   4.028096
Transition
    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
      logbmi |  30,398  29412.3988    3.325955    .181115   2.655519   4.264244

Also, see the comparison in stock pop BEFORE and AFTER imputing in kludge.do: Before:

    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
      logbmi |   7,549  7463.84219    3.316639    .171767   2.699424   4.028096

After:

    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
      logbmi |   8,780  8779.96029    3.317211   .1727126   2.699424   4.028096

Change in mean is small, and we gain ~1200 data points.

Just to put this to bed, here are the T-tests for BMI in both CV1 and the minimal scenario.

CV1
variable fem_mean_wave3 elsa_mean_wave3 p_value_wave3 fem_mean_wave4 elsa_mean_wave4 p_value_wave4 fem_mean_wave5 elsa_mean_wave5 p_value_wave5 fem_mean_wave6 elsa_mean_wave6 p_value_wave6 fem_mean_wave7 elsa_mean_wave7 p_value_wave7 fem_mean_wave8 elsa_mean_wave8 p_value_wave8
BMI       28.06897 28.36905 0.00506       27.92054 28.31962 0.00057       27.78603 27.98109 0.12615
Smoke now 0.13437 0.14308 0.15313 0.12042 0.13262 0.05444 0.10865 0.11606 0.23156 0.09922 0.10176 0.67785 0.09 0.09444 0.48308 0.08372 0.07076 0.03078
Smoke ever 0.6341 0.6275 0.43436 0.63349 0.63043 0.73383 0.63239 0.64774 0.09607 0.63145 0.65623 0.00996 0.63137 0.66739 0.00044 0.62924 0.6583 0.00851
Minimal
variable fem_mean_wave3 elsa_mean_wave3 p_value_wave3 fem_mean_wave4 elsa_mean_wave4 p_value_wave4 fem_mean_wave5 elsa_mean_wave5 p_value_wave5 fem_mean_wave6 elsa_mean_wave6 p_value_wave6 fem_mean_wave7 elsa_mean_wave7 p_value_wave7 fem_mean_wave8 elsa_mean_wave8 p_value_wave8
BMI       28.03767 28.30303 0.00037       27.96525 28.28117 7E-05       27.85072 27.97677 0.16127
Smoke now 0.14356 0.14211 0.73142 0.13067 0.13101 0.93744 0.1194 0.11769 0.68935 0.10931 0.1036 0.17839 0.09961 0.09553 0.35365 0.09227 0.07739 0.00057
Smoke ever 0.64251 0.63611 0.26842 0.64621 0.63148 0.01722 0.6495 0.65028 0.90256 0.65216 0.65828 0.35412 0.65471 0.66837 0.05229 0.657 0.66535 0.2703

Both models are significantly better in terms of BMI than before the changes in this issue as well as #48.