Investigate BMI - Githubissues

ld-archer commented 3 years ago

BMI is a bit of a problem in the current version of the model.

Firstly, BMI is was only measured and recorded on even waves in ELSA, which means we have NO data for odd waves. On top of that, there is also a considerable proportion of BMI data missing for the even waves, hovering between 40-60% missing (ridiculous).

To handle the missing data, we first impute BMI for even waves using hotdecking. Then, we interpolate the odd waves using even waves. This already introduces some problems, for example, people who have a low BMI in wave 2 (e.g. 16) and a higher BMI in wave 4 (e.g. 22), the values for wave 1 are impossibly low due to interpolating outside the range of known data.

Also, after interpolating we add some noise using a normal distribution. We take a sample of a normal distribution with limits (-2, 2), and add this to the interpolated data. We did this in the first place because the lag of bmi was almost a perfect predictor of current, which shouldn't be the case.

ToDo

[x] Check mean of BMI in CV1 stock pop and CV1 transition data
[x] When looking at the minimal models and comparing to ELSA data (see comment in #35 ) focus on BMI and try to understand the outputs
[x] Remove RMSE term from BMI model and see what impact this has (do this by directly editing the .est file before simulation, not the code)

ld-archer commented 3 years ago

Update

Comparing Means

No significant difference in the weighted means of these pop.s (stock_CV1 & transition):

stock_CV1


    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
         bmi |   3,814  13275679.6     28.0638   4.979973    13.8869    56.1539

Transition

    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
         bmi |  50,937   49099.014     27.5049   5.577009   .8737087   84.80872

Something of note however is the minimum and maximum values. 13.8 is dangerously low, and I think would be considered anorexic in most people. Is this realistic then? Should we remove it? Obviously the min value of 0.874 if impossible and will need to be handled. This should be handled in reshape_long so we only need to do it once, and I think the culprit is the interpolation step.

[x] Check if the interpolation/imputation step causes the weird values
[x] Drop any BMI value less than 15
[x] Impute any missing in kludge.do, and check distribution before and after

RMSE

Removing this term made things worse. The average BMI dropped by approximately half a point for both CV and minimal runs, think this means we should keep it.

Transition Models

Made big changes to the logbmi transition model. Removed l2logbmi as a predictor, and improved massively. Still experimenting with the transitions but we went from ending with a mean BMI of 34 in wave 8, to mean BMI of 28 over the same time. The p value for wave 8 T-test is also above 0.05.

ld-archer commented 3 years ago

Low BMI

Comparing the summary before and after imputation, the problem is there beforehand, but made worse by the imputation step: ELSA_long.dta Before

    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
         bmi |  28,663  27758.8844     28.2397   5.262646   2.910022   71.11111

After

    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
         bmi |  56,659  54750.5507    27.34451   5.603264   .8737087   84.80872

ld-archer commented 3 years ago

Final Update

Low values (> 14) have been removed AFTER the interpolation step. This step was causing the low values (and some high ones as well, but not as troublesome), so now the low values are removed and imputed in kludge.do by hotdecking. Showing logbmi as this is the variable that is imputed in kludge.do. Should have been showing that for every time I compared stock to transition but the final values are good, so I'm happy.

stock_CV1

    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
      logbmi |   4,405    15428835    3.321269   .1730587   2.699424   4.028096

Transition

    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
      logbmi |  30,398  29412.3988    3.325955    .181115   2.655519   4.264244

Also, see the comparison in stock pop BEFORE and AFTER imputing in kludge.do: Before:

    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
      logbmi |   7,549  7463.84219    3.316639    .171767   2.699424   4.028096

After:

    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
      logbmi |   8,780  8779.96029    3.317211   .1727126   2.699424   4.028096

Change in mean is small, and we gain ~1200 data points.

Just to put this to bed, here are the T-tests for BMI in both CV1 and the minimal scenario.

CV1

variable	fem_mean_wave3	elsa_mean_wave3	p_value_wave3	fem_mean_wave4	elsa_mean_wave4	p_value_wave4	fem_mean_wave5	elsa_mean_wave5	p_value_wave5	fem_mean_wave6	elsa_mean_wave6	p_value_wave6	fem_mean_wave7	elsa_mean_wave7	p_value_wave7	fem_mean_wave8	elsa_mean_wave8	p_value_wave8
BMI				28.06897	28.36905	0.00506				27.92054	28.31962	0.00057				27.78603	27.98109	0.12615
Smoke now	0.13437	0.14308	0.15313	0.12042	0.13262	0.05444	0.10865	0.11606	0.23156	0.09922	0.10176	0.67785	0.09	0.09444	0.48308	0.08372	0.07076	0.03078
Smoke ever	0.6341	0.6275	0.43436	0.63349	0.63043	0.73383	0.63239	0.64774	0.09607	0.63145	0.65623	0.00996	0.63137	0.66739	0.00044	0.62924	0.6583	0.00851

Minimal

variable	fem_mean_wave3	elsa_mean_wave3	p_value_wave3	fem_mean_wave4	elsa_mean_wave4	p_value_wave4	fem_mean_wave5	elsa_mean_wave5	p_value_wave5	fem_mean_wave6	elsa_mean_wave6	p_value_wave6	fem_mean_wave7	elsa_mean_wave7	p_value_wave7	fem_mean_wave8	elsa_mean_wave8	p_value_wave8
BMI				28.03767	28.30303	0.00037				27.96525	28.28117	7E-05				27.85072	27.97677	0.16127
Smoke now	0.14356	0.14211	0.73142	0.13067	0.13101	0.93744	0.1194	0.11769	0.68935	0.10931	0.1036	0.17839	0.09961	0.09553	0.35365	0.09227	0.07739	0.00057
Smoke ever	0.64251	0.63611	0.26842	0.64621	0.63148	0.01722	0.6495	0.65028	0.90256	0.65216	0.65828	0.35412	0.65471	0.66837	0.05229	0.657	0.66535	0.2703

Both models are significantly better in terms of BMI than before the changes in this issue as well as #48.

ld-archer / E_FEM

Investigate BMI #41

Update

Comparing Means

stock_CV1

Transition

RMSE

Transition Models

Low BMI

stock_CV1

Transition

CV1

Minimal