Closed ld-archer closed 3 years ago
No significant difference in the weighted means of these pop.s (stock_CV1 & transition):
Variable | Obs Weight Mean Std. Dev. Min Max
-------------+-----------------------------------------------------------------
bmi | 3,814 13275679.6 28.0638 4.979973 13.8869 56.1539
Variable | Obs Weight Mean Std. Dev. Min Max
-------------+-----------------------------------------------------------------
bmi | 50,937 49099.014 27.5049 5.577009 .8737087 84.80872
Something of note however is the minimum and maximum values. 13.8 is dangerously low, and I think would be considered anorexic in most people. Is this realistic then? Should we remove it? Obviously the min value of 0.874 if impossible and will need to be handled. This should be handled in reshape_long so we only need to do it once, and I think the culprit is the interpolation step.
Removing this term made things worse. The average BMI dropped by approximately half a point for both CV and minimal runs, think this means we should keep it.
Made big changes to the logbmi transition model. Removed l2logbmi as a predictor, and improved massively. Still experimenting with the transitions but we went from ending with a mean BMI of 34 in wave 8, to mean BMI of 28 over the same time. The p value for wave 8 T-test is also above 0.05.
Comparing the summary before and after imputation, the problem is there beforehand, but made worse by the imputation step: ELSA_long.dta Before
Variable | Obs Weight Mean Std. Dev. Min Max
-------------+-----------------------------------------------------------------
bmi | 28,663 27758.8844 28.2397 5.262646 2.910022 71.11111
After
Variable | Obs Weight Mean Std. Dev. Min Max
-------------+-----------------------------------------------------------------
bmi | 56,659 54750.5507 27.34451 5.603264 .8737087 84.80872
Final Update
Low values (> 14) have been removed AFTER the interpolation step. This step was causing the low values (and some high ones as well, but not as troublesome), so now the low values are removed and imputed in kludge.do by hotdecking. Showing logbmi as this is the variable that is imputed in kludge.do. Should have been showing that for every time I compared stock to transition but the final values are good, so I'm happy.
Variable | Obs Weight Mean Std. Dev. Min Max
-------------+-----------------------------------------------------------------
logbmi | 4,405 15428835 3.321269 .1730587 2.699424 4.028096
Variable | Obs Weight Mean Std. Dev. Min Max
-------------+-----------------------------------------------------------------
logbmi | 30,398 29412.3988 3.325955 .181115 2.655519 4.264244
Also, see the comparison in stock pop BEFORE and AFTER imputing in kludge.do: Before:
Variable | Obs Weight Mean Std. Dev. Min Max
-------------+-----------------------------------------------------------------
logbmi | 7,549 7463.84219 3.316639 .171767 2.699424 4.028096
After:
Variable | Obs Weight Mean Std. Dev. Min Max
-------------+-----------------------------------------------------------------
logbmi | 8,780 8779.96029 3.317211 .1727126 2.699424 4.028096
Change in mean is small, and we gain ~1200 data points.
Just to put this to bed, here are the T-tests for BMI in both CV1 and the minimal scenario.
variable | fem_mean_wave3 | elsa_mean_wave3 | p_value_wave3 | fem_mean_wave4 | elsa_mean_wave4 | p_value_wave4 | fem_mean_wave5 | elsa_mean_wave5 | p_value_wave5 | fem_mean_wave6 | elsa_mean_wave6 | p_value_wave6 | fem_mean_wave7 | elsa_mean_wave7 | p_value_wave7 | fem_mean_wave8 | elsa_mean_wave8 | p_value_wave8 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BMI | 28.06897 | 28.36905 | 0.00506 | 27.92054 | 28.31962 | 0.00057 | 27.78603 | 27.98109 | 0.12615 | |||||||||
Smoke now | 0.13437 | 0.14308 | 0.15313 | 0.12042 | 0.13262 | 0.05444 | 0.10865 | 0.11606 | 0.23156 | 0.09922 | 0.10176 | 0.67785 | 0.09 | 0.09444 | 0.48308 | 0.08372 | 0.07076 | 0.03078 |
Smoke ever | 0.6341 | 0.6275 | 0.43436 | 0.63349 | 0.63043 | 0.73383 | 0.63239 | 0.64774 | 0.09607 | 0.63145 | 0.65623 | 0.00996 | 0.63137 | 0.66739 | 0.00044 | 0.62924 | 0.6583 | 0.00851 |
variable | fem_mean_wave3 | elsa_mean_wave3 | p_value_wave3 | fem_mean_wave4 | elsa_mean_wave4 | p_value_wave4 | fem_mean_wave5 | elsa_mean_wave5 | p_value_wave5 | fem_mean_wave6 | elsa_mean_wave6 | p_value_wave6 | fem_mean_wave7 | elsa_mean_wave7 | p_value_wave7 | fem_mean_wave8 | elsa_mean_wave8 | p_value_wave8 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BMI | 28.03767 | 28.30303 | 0.00037 | 27.96525 | 28.28117 | 7E-05 | 27.85072 | 27.97677 | 0.16127 | |||||||||
Smoke now | 0.14356 | 0.14211 | 0.73142 | 0.13067 | 0.13101 | 0.93744 | 0.1194 | 0.11769 | 0.68935 | 0.10931 | 0.1036 | 0.17839 | 0.09961 | 0.09553 | 0.35365 | 0.09227 | 0.07739 | 0.00057 |
Smoke ever | 0.64251 | 0.63611 | 0.26842 | 0.64621 | 0.63148 | 0.01722 | 0.6495 | 0.65028 | 0.90256 | 0.65216 | 0.65828 | 0.35412 | 0.65471 | 0.66837 | 0.05229 | 0.657 | 0.66535 | 0.2703 |
Both models are significantly better in terms of BMI than before the changes in this issue as well as #48.
BMI is a bit of a problem in the current version of the model.
Firstly, BMI is was only measured and recorded on even waves in ELSA, which means we have NO data for odd waves. On top of that, there is also a considerable proportion of BMI data missing for the even waves, hovering between 40-60% missing (ridiculous).
To handle the missing data, we first impute BMI for even waves using hotdecking. Then, we interpolate the odd waves using even waves. This already introduces some problems, for example, people who have a low BMI in wave 2 (e.g. 16) and a higher BMI in wave 4 (e.g. 22), the values for wave 1 are impossibly low due to interpolating outside the range of known data.
Also, after interpolating we add some noise using a normal distribution. We take a sample of a normal distribution with limits (-2, 2), and add this to the interpolated data. We did this in the first place because the lag of bmi was almost a perfect predictor of current, which shouldn't be the case.
ToDo