Add socioeconomic variables for next project

ld-archer commented 3 years ago

Next project is looking at links between socioeconomic status and health, so will need to add back variables related to wealth and income.

Vars to add:

[x] Inflation multiplier: CyyyyCPINDEX
[x] Total family wealth: HwATOTB
[x] Total couple level income: HwITOT

Down the line might possibly consider disaggregating the combined variables but this will be a good start.

ld-archer commented 3 years ago

Wealth and income both back in, most of the work was done as they were still in the input data, just had to re-add them to the model. Inflation multiplier will need to be included, but I think I can get away with only including it in reshape_long. Can do the calculations to account for inflation in that script then all input populations will inherit.

ld-archer commented 3 years ago

This has definitely not worked as we want to, see table below for Cross-validation T-tests for income and wealth.

variable	fem_mean_wave1	elsa_mean_wave1	p_value_wave1	fem_mean_wave2	elsa_mean_wave2	p_value_wave2	fem_mean_wave3	elsa_mean_wave3	p_value_wave3	fem_mean_wave4	elsa_mean_wave4	p_value_wave4	fem_mean_wave5	elsa_mean_wave5	p_value_wave5	fem_mean_wave6	elsa_mean_wave6	p_value_wave6	fem_mean_wave7	elsa_mean_wave7	p_value_wave7	fem_mean_wave8	elsa_mean_wave8	p_value_wave8	fem_mean_wave9	elsa_mean_wave9	p_value_wave9
Total Family Wealth (thou.)	248.5051	205.6091	0	186.0797	243.5422	0	193.403	276.8282	0	206.8602	280.1831	0	216.302	293.1316	0	225.9	314.7851	0	232.6775	347.3489	0	238.6429	397.9592	0	240.8246	441.0225	0
Total Family Income (thou.)	19.73524	18.9182	0.00137	21.48945	19.49766	0	21.78878	20.9715	0.01026	22.14617	21.40408	0.01726	22.23113	21.83524	0.21191	22.31243	23.81972	1E-05	22.32754	25.30901	0	22.30042	26.44413	0	22.22436	27.41521	0

Need to look into this, looks like model projections show both income and wealth and reducing wave by wave whereas actual ELSA data shows fairly significant increase with each wave. First thing to check is the populations we use for the T-tests, its possible we are not using the exact same populations from each side, if the populations aren't the same a difference in the age distribution or something similar could cause this.

ld-archer commented 3 years ago

After meeting with Bryan, a few potential reasons for this poor performance have been identified:

Not accounting for inflation
Converting values to log when the raw values include negatives and zero
Values are at the benefit unit level (i.e. within couple if in couple, or single if single)

To fix these problems, we need to:

[x] Include the CPINDEX variable and adjust for inflation
- Should try to set 2012 as base year as this is the first year of our simulation
[x] Don't take log of financial values initially
- Took logs initially for 2 reasons:
  1. Long right tail in distribution of raw values can mean that traditional regression models lose information from extreme values
  2. The difference from e.g. £1000 - £10,000 is more impactful in terms of outcomes than £100,000 - £109,000 despite being same amount
- Logs are not the right approach here with negative and zero values, will eventually find a more complicated solution to this problem
[x] Adjust couples to not be at the benefit unit level

ld-archer commented 3 years ago

Big improvement even just after removing the log values - see commit 1075c6d

ld-archer commented 3 years ago

Both inflation adjustment and benefit unit adjustment are now complete in enhancement/70-socioeconomic-vars (see commit ba28bff in that branch).

Inflation adjustment had big positive impact on the T-tests, however couple benefit unit adjustment seems to have undone some of that good work. Last step here to see what kind of impact we can make is to deal with the topcoded data from ELSA.

ELSA provides a special code of .t for total family income above £900,000 for anonymity purposes (see p599 of Harmonized ELSA codebook G.2). We will replace these values in reshape_long.do and crossvalidation_ELSA_core.do with the topcoded value (£900,000). In the future if this is not working or we want to be more clever about it, we can look into tobit models for predicting what these values would be over the threshold.

ld-archer / E_FEM

Add socioeconomic variables for next project #70