Closed ld-archer closed 3 years ago
Update Removed the perturbation, and the outputs were as expected. The prediction was good, and the T-tests p-value was very high, but the model was almost entirely predicted by lag of logbmi.
regress
logbmi
male .0014929
white .0020979
hsless .0006197
college -.002541
l2age65l -.0004071
l2age6574 -.0003958
l2age75p -.0009692
l2logbmi .9929164
_cons .0477836
| Root Mean Square Error
_rmse .0354048
The prediction model improved by adding the perturbation back in (or at least looked more realistic), but now probably has gone a bit too far in the other direction:
regress
logbmi
male .0103008
white -.0374643
hsless .0263987
college -.0376384
l2age65l -.0017118
l2age6574 -.0002671
l2age75p -.0035212
l2logbmi .0247247
_cons 3.39122
| Root Mean Square Error
_rmse .1762296
The RMSE term here is high. To match the US logbmi model, we need an RMSE of ~0.08, so we need to tweak the bounds for the rnormal distribution (currently -2, 2). Also just to provide some evidence that this change hasn't made some crazy changes down the line, or introduced any systematic error, here is the summary of BMI from the baseline outputs in 2012 and 2036:
. use "/home/luke/Documents/E_FEM_clean/E_FEM/output/ELSA_Baseline/detailed_output/y2012_rep1.dta"
(ELSA_Baseline 2012)
. codebook bmi
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
bmi exp of log BMI
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
type: numeric (float)
range: [14.232368,63.273563] units: 1.000e-07
unique values: 8,267 missing .: 0/10,260
mean: 28.2937
std. dev: 5.31906
percentiles: 10% 25% 50% 75% 90%
22.3992 24.7061 27.534 30.9743 35.0567
. use "/home/luke/Documents/E_FEM_clean/E_FEM/output/ELSA_Baseline/detailed_output/y2036_rep5.dta"
(ELSA_Baseline 2036)
. codebook bmi
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
bmi exp of log BMI
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
type: numeric (float)
range: [14.141178,54.140778] units: 1.000e-06
unique values: 8,453 missing .: 0/8,470
mean: 27.9593
std. dev: 5.11579
percentiles: 10% 25% 50% 75% 90%
21.7853 24.28 27.4627 31.0424 34.7361
The values don't change much over time, but at present the only predictor (aside from sex and age) is the lag of BMI, so this is expected.
According to this website (and others just this is the first I stumbled across), you can't calculate logs for a negative number. Going to try talk to Rob Clay and see if he can help clear this up.
Rob told me about the lognormal distribution (Wikipedia), and I think that is what we want. It is essentially just taking the exponential of a normal distribution, which ensures that any sample will be positive (discussed in this Statalist post). The perturbation is now implemented as so:
*** Now add noise
* Take the exponential of rnormal distribution, then add this
gen rand = exp(rnormal(-1, 1)) if (wave==1 | wave==3 | wave==5 | wave==7)
replace logbmi = logbmi + rand if !missing(rand)
drop rand
The output for this looks good:
regress
logbmi
male .0076413
white -.016529
hsless .0271319
college -.0373643
l2age65l -.000467
l2age6574 -.0008259
l2age75p -.0047093
l2logbmi .0507376
_cons 3.172113
| Root Mean Square Error
_rmse .1741516
The coefficient is perhaps closer to what we would expect than the previous attempts, and the outputs in the future also look pretty good:
. use "/home/luke/Documents/E_FEM_clean/E_FEM/output/ELSA_Baseline/detailed_output/y2012_rep1.dta"
(ELSA_Baseline 2012)
. codebook bmi
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
bmi exp of log BMI
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
type: numeric (float)
range: [14.232368,63.273563] units: 1.000e-07
unique values: 8,260 missing .: 0/10,260
mean: 28.3385
std. dev: 5.36592
percentiles: 10% 25% 50% 75% 90%
22.37 24.6998 27.5808 31.1125 35.1471
. use "/home/luke/Documents/E_FEM_clean/E_FEM/output/ELSA_Baseline/detailed_output/y2040_rep2.dta"
(ELSA_Baseline 2040)
. codebook bmi
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
bmi exp of log BMI
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
type: numeric (float)
range: [12.894082,50.097458] units: 1.000e-07
unique values: 7,947 missing .: 0/7,970
mean: 26.529
std. dev: 4.84926
percentiles: 10% 25% 50% 75% 90%
20.6713 23.0888 26.0815 29.4759 32.9333
.
The min value of ~13 is a bit worrying, but still not impossible I don't think. For reference, this value is for a 96 year old male, so could just be extremely frail?
Now to play with the bounds for the normal distribution to get the RMSE term to ~0.8. I think we need larger bounds than present, so trying -2, 2 first.
Nope had it completely the wrong way round, need smaller bounds and LESS noise to reduce the Root Mean Square Error... Clue is in the name! Model:
regress
logbmi
male .0083428
white -.0161309
hsless .0281868
college -.038732
l2age65l -.0004097
l2age6574 -.0009618
l2age75p -.004859
l2logbmi .0009477
_cons 3.363604
| Root Mean Square Error
_rmse .178632
Trying (-0.5, 0.5).
Starting to realise that I was using the rnormal() function slightly wrong, an unfortunate relic from when we were using the runiform() function that works slightly differently. Anyway, here are the trial and error values that get us to where we want:
rnormal (mean, sd) | BMI Coefficient | RMSE |
---|---|---|
(1, 0.5) | 0.012 | 0.1776 |
(1, 2) | 0.000012 | 0.1787 |
(2, 0.5) | 0.00181 | 0.1786 |
(0, 0.5) | 0.0816 | 0.1714 |
(0, 0.1) | 0.751 | 0.0931 |
(0, 0.09) | 0.788 | 0.0868 |
(0, 0.08) | 0.824 | 0.0801 |
Final values are where we want them to be, debugging now so will see what effect this has on T-tests and BMI projections.
Went back to perturbing the non-logged value before converting to logs. Settled on a normal distribution of (0, 1.8), which has resulted in the following transition model:
regress
logbmi
male .00218
white -.0041433
hsless .0050747
college -.0068029
l2age65l -.0004866
l2age6574 -.0003002
l2age75p -.001493
l2logbmi .8664084
_cons .4805569
| Root Mean Square Error
_rmse .0709711
Predictive power of l2logbmi
is strong, but not over the top. Also, the RMSE term is similar to that seen in the American model. Projections look good too:
. use "/home/luke/Documents/E_FEM_clean/E_FEM/output/ELSA_Baseline/detailed_output/y2012_rep1.dta"
(ELSA_Baseline 2012)
. codebook bmi
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
bmi exp of log BMI
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
type: numeric (float)
range: [14.232368,63.273563] units: 1.000e-07
unique values: 8,258 missing .: 0/10,260
mean: 28.3738
std. dev: 5.41119
percentiles: 10% 25% 50% 75% 90%
22.4124 24.714 27.5681 31.1179 35.2483
. use "/home/luke/Documents/E_FEM_clean/E_FEM/output/ELSA_Baseline/detailed_output/y2032_rep1.dta"
(ELSA_Baseline 2032)
. codebook bmi
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
bmi exp of log BMI
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
type: numeric (float)
range: [15.61116,49.203938] units: 1.000e-06
unique values: 9,094 missing .: 0/9,116
mean: 27.8157
std. dev: 4.42034
percentiles: 10% 25% 50% 75% 90%
22.4994 24.7219 27.466 30.5434 33.4366
The BMI perturbation step has been left alone for a long time, but Bryan has pointed out a potential reason for the poor prediction.
I removed the l2logbmi term from the logbmi transition model, which seemed to make the prediction much better, and the cross validation provided some evidence to support this. However, Bryan explained that this makes the logbmi variable less 'sticky', meaning that if we tried to intervene on a respondent and change BMI, this value would then have no effect on future values. This could mean that interventions on BMI don't work at all, so this needs to be fixed.
There are a few things to do before trying the complicated things:
If the above steps don't stop the crazy over prediction (or make BMI prediction bad in other ways) then we need to try something more complicated. Bryan spoke about doing a 4 year prediction cycle instead, but that would mean anything relying on BMI would also need to be on a 4 year prediction cycle too, which is not very good.