BMI Perturbation - Githubissues

The BMI perturbation step has been left alone for a long time, but Bryan has pointed out a potential reason for the poor prediction.

I removed the l2logbmi term from the logbmi transition model, which seemed to make the prediction much better, and the cross validation provided some evidence to support this. However, Bryan explained that this makes the logbmi variable less 'sticky', meaning that if we tried to intervene on a respondent and change BMI, this value would then have no effect on future values. This could mean that interventions on BMI don't work at all, so this needs to be fixed.

There are a few things to do before trying the complicated things:

[x] First, remove the perturbation step and see what effect this has on the T-tests
[x] Second, add the perturbation step back in, but this time perturb the logbmi values and NOT normal BMI
- This will be a bit of trial and error for the bounds for the normal distribution to be sampled from
- If we like the outputs, stick with this.

If the above steps don't stop the crazy over prediction (or make BMI prediction bad in other ways) then we need to try something more complicated. Bryan spoke about doing a 4 year prediction cycle instead, but that would mean anything relying on BMI would also need to be on a 4 year prediction cycle too, which is not very good.

Update Removed the perturbation, and the outputs were as expected. The prediction was good, and the T-tests p-value was very high, but the model was almost entirely predicted by lag of logbmi.

regress
logbmi
male    .0014929
white   .0020979
hsless  .0006197
college -.002541
l2age65l    -.0004071
l2age6574   -.0003958
l2age75p    -.0009692
l2logbmi    .9929164
_cons   .0477836
| Root Mean Square Error
_rmse   .0354048

The prediction model improved by adding the perturbation back in (or at least looked more realistic), but now probably has gone a bit too far in the other direction:

regress
logbmi
male    .0103008
white   -.0374643
hsless  .0263987
college -.0376384
l2age65l    -.0017118
l2age6574   -.0002671
l2age75p    -.0035212
l2logbmi    .0247247
_cons   3.39122
| Root Mean Square Error
_rmse   .1762296

The RMSE term here is high. To match the US logbmi model, we need an RMSE of ~0.08, so we need to tweak the bounds for the rnormal distribution (currently -2, 2). Also just to provide some evidence that this change hasn't made some crazy changes down the line, or introduced any systematic error, here is the summary of BMI from the baseline outputs in 2012 and 2036:


. use "/home/luke/Documents/E_FEM_clean/E_FEM/output/ELSA_Baseline/detailed_output/y2012_rep1.dta"
(ELSA_Baseline 2012)

. codebook bmi

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
bmi                                                                                                                                                            exp of log BMI
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                  type:  numeric (float)

                 range:  [14.232368,63.273563]        units:  1.000e-07
         unique values:  8,267                    missing .:  0/10,260

                  mean:   28.2937
              std. dev:   5.31906

           percentiles:        10%       25%       50%       75%       90%
                           22.3992   24.7061    27.534   30.9743   35.0567

. use "/home/luke/Documents/E_FEM_clean/E_FEM/output/ELSA_Baseline/detailed_output/y2036_rep5.dta"
(ELSA_Baseline 2036)

. codebook bmi

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
bmi                                                                                                                                                            exp of log BMI
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [14.141178,54.140778]        units:  1.000e-06
         unique values:  8,453                    missing .:  0/8,470

                  mean:   27.9593
              std. dev:   5.11579

           percentiles:        10%       25%       50%       75%       90%
                           21.7853     24.28   27.4627   31.0424   34.7361

The values don't change much over time, but at present the only predictor (aside from sex and age) is the lag of BMI, so this is expected.

According to this website (and others just this is the first I stumbled across), you can't calculate logs for a negative number. Going to try talk to Rob Clay and see if he can help clear this up.

Rob told me about the lognormal distribution (Wikipedia), and I think that is what we want. It is essentially just taking the exponential of a normal distribution, which ensures that any sample will be positive (discussed in this Statalist post). The perturbation is now implemented as so:

*** Now add noise
* Take the exponential of rnormal distribution, then add this
gen rand = exp(rnormal(-1, 1)) if (wave==1 | wave==3 | wave==5 | wave==7)
replace logbmi = logbmi + rand if !missing(rand)
drop rand

The output for this looks good:

regress
logbmi
male    .0076413
white   -.016529
hsless  .0271319
college -.0373643
l2age65l    -.000467
l2age6574   -.0008259
l2age75p    -.0047093
l2logbmi    .0507376
_cons   3.172113
| Root Mean Square Error
_rmse   .1741516

The coefficient is perhaps closer to what we would expect than the previous attempts, and the outputs in the future also look pretty good:

. use "/home/luke/Documents/E_FEM_clean/E_FEM/output/ELSA_Baseline/detailed_output/y2012_rep1.dta"
(ELSA_Baseline 2012)

. codebook bmi

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
bmi                                                                                                                                                            exp of log BMI
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [14.232368,63.273563]        units:  1.000e-07
         unique values:  8,260                    missing .:  0/10,260

                  mean:   28.3385
              std. dev:   5.36592

           percentiles:        10%       25%       50%       75%       90%
                             22.37   24.6998   27.5808   31.1125   35.1471

. use "/home/luke/Documents/E_FEM_clean/E_FEM/output/ELSA_Baseline/detailed_output/y2040_rep2.dta"
(ELSA_Baseline 2040)

. codebook bmi

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
bmi                                                                                                                                                            exp of log BMI
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [12.894082,50.097458]        units:  1.000e-07
         unique values:  7,947                    missing .:  0/7,970

                  mean:    26.529
              std. dev:   4.84926

           percentiles:        10%       25%       50%       75%       90%
                           20.6713   23.0888   26.0815   29.4759   32.9333

.

The min value of ~13 is a bit worrying, but still not impossible I don't think. For reference, this value is for a 96 year old male, so could just be extremely frail?

Now to play with the bounds for the normal distribution to get the RMSE term to ~0.8. I think we need larger bounds than present, so trying -2, 2 first.

Nope had it completely the wrong way round, need smaller bounds and LESS noise to reduce the Root Mean Square Error... Clue is in the name! Model:

regress
logbmi
male    .0083428
white   -.0161309
hsless  .0281868
college -.038732
l2age65l    -.0004097
l2age6574   -.0009618
l2age75p    -.004859
l2logbmi    .0009477
_cons   3.363604
| Root Mean Square Error
_rmse   .178632

Trying (-0.5, 0.5).

Starting to realise that I was using the rnormal() function slightly wrong, an unfortunate relic from when we were using the runiform() function that works slightly differently. Anyway, here are the trial and error values that get us to where we want:

rnormal (mean, sd)	BMI Coefficient	RMSE
(1, 0.5)	0.012	0.1776
(1, 2)	0.000012	0.1787
(2, 0.5)	0.00181	0.1786
(0, 0.5)	0.0816	0.1714
(0, 0.1)	0.751	0.0931
(0, 0.09)	0.788	0.0868
(0, 0.08)	0.824	0.0801

Final values are where we want them to be, debugging now so will see what effect this has on T-tests and BMI projections.

Went back to perturbing the non-logged value before converting to logs. Settled on a normal distribution of (0, 1.8), which has resulted in the following transition model:

regress
logbmi
male    .00218
white   -.0041433
hsless  .0050747
college -.0068029
l2age65l    -.0004866
l2age6574   -.0003002
l2age75p    -.001493
l2logbmi    .8664084
_cons   .4805569
| Root Mean Square Error
_rmse   .0709711

Predictive power of l2logbmi is strong, but not over the top. Also, the RMSE term is similar to that seen in the American model. Projections look good too:

. use "/home/luke/Documents/E_FEM_clean/E_FEM/output/ELSA_Baseline/detailed_output/y2012_rep1.dta"
(ELSA_Baseline 2012)

. codebook bmi

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
bmi                                                                                                                                                            exp of log BMI
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [14.232368,63.273563]        units:  1.000e-07
         unique values:  8,258                    missing .:  0/10,260

                  mean:   28.3738
              std. dev:   5.41119

           percentiles:        10%       25%       50%       75%       90%
                           22.4124    24.714   27.5681   31.1179   35.2483

. use "/home/luke/Documents/E_FEM_clean/E_FEM/output/ELSA_Baseline/detailed_output/y2032_rep1.dta"
(ELSA_Baseline 2032)

. codebook bmi

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
bmi                                                                                                                                                            exp of log BMI
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [15.61116,49.203938]         units:  1.000e-06
         unique values:  9,094                    missing .:  0/9,116

                  mean:   27.8157
              std. dev:   4.42034

           percentiles:        10%       25%       50%       75%       90%
                           22.4994   24.7219    27.466   30.5434   33.4366

ld-archer / E_FEM

BMI Perturbation #56