Calculate units of alcohol from wave-specific data

ld-archer commented 2 years ago

Unit estimation

This paper (Iparraguirre 2015) estimates the units drank based on information from the individual wave data files that is not then included in the harmonised data. Specifically, three variables are included from wave 4 onwards:

pints of beer consumed in the past week
glasses of wine consumed in the past week
measures of spirit consumed in the past week

The paper then uses 3 different calculators to work out the number of units from these three measures NHS, General Lifestyle Survey (GLS), and drinkaware. I think the NHS guidelines would be a good place to start.

In practical terms then, to use this information we would need another script to run during reshape_long.do, probably best to do this before the data is reshaped as I think it makes more sense that way. We would need to load in the datafiles 1 by 1 (which may mean copying the wave 4-9 wave_x_elsa_data_v3.dta files into input_data/, haven't decided on this yet), then use the information for each idauniq to calculate number of units and add this onto the harmonised wide format dataset. Makes sense to open a new issue for this problem.

_Originally posted by @ld-archer in https://github.com/ld-archer/E_FEM/issues/77#issuecomment-959716849_

ld-archer commented 2 years ago

[x] Write script to be run in reshape_long.do and generate a separate value for each of pints of beer, glasses of wine, and measures of spirits
- Variables are:
  - scdrspi: Number of measures of spirit the respondent had in the last 7 days
  - scdrwin: Number of glasses of wine the respondent had in the last 7 days
  - scdrpin: Number of pints of beer the respondent had in the last 7 days
[x] Add the new units variable to the data reshape and check it has worked properly
[x] Generate an ordinal variable for the type of drinker:
- Abstainer
- Moderate
- Increasing-risk
- High-risk
- [x] Create dummys for the ordinal variable for use in transition models
- [x] Go through the checklist and add the vars to all the scripts where they are needed (remove problem_drinker also)

ld-archer commented 2 years ago

Another thing to think about - this information is collected for the timespan of the week preceding the survey from waves 4 onwards, whereas waves 2&3 have this information for the heaviest day in the past week. I think the best way to deal with this is therefore to calculate the units from wave 4+ over the past week, then impute this variable for waves 1-3. Either using hotdecking or multiple imputation in stata.

[x] Impute weekly units for waves 1-3

ld-archer commented 2 years ago

Produce some summary stats from the alcbase variable:

Mean units by drinking group (abstainer, moderate etc.)
% missing in each wave
% missing by:
- gender
- age group
- education level

Produce regression estimates both WITH and WITHOUT including l2alcbase in prediction of alcbase. Look at coefficients here, I am assuming that the coefficient of l2alcbase will be close to 1 for alcbase but might be interesting.

Send this stuff to Bryan.

ld-archer commented 2 years ago

Need to do accounting of the alcbase variables in HealthModule.cpp

This means things like making sure drink == 1 if alcbase > 0. Check out the old code for smkint (and remove that at the same time) as well as exstat and adlstat.

ld-archer commented 2 years ago

Some progress since last update but also a lot of new questions raised. Additional wave specific variables with information about alcohol consumption have been included in the process to better categorise those with missing data on drinks, however this is only truly helpful in the identification of abstainers and has raised more issues. Those issues being:

How do we really know if someone abstains from all alcohol use or just had a week off drinking?
Those who are not true abstainers but drink infrequently are different to true abstainers in terms of health and social life, how do we identify those people and represent that in the model?
The alcohol consumption information is asked in the self-completion questionnaire and not the core ELSA survey. Does this introduce any bias in terms of people (or groups) that did not take part in the self-completion part of the survey?
How can we improve the prediction of alcbase, as the current iteration shows some regression to the mean.

Some answers and thoughts:

Abstainers

True abstainers are hard to identify in ELSA, but other variables can help to solve this problem. The scako variable reports how often a respondent has consumed alcohol over the past 12 months, ranging from 'not at all' to 'every day'. Those who answered 'not at all' can be considered true abstainers, whilst decisions have to be made regarding some groups (such as those who drink once or twice a year, up to once or twice a month).

Infrequent drinkers

Using the same variable (scako), we can identify people who report drinking 'once or twice a week' or more, and have reported zero units in the week leading up to the survey. In the current version of alcbase, we are saying these people are the same as abstainers in terms of alcohol consumption, but this is obviously incorrect. We therefore need a cleverer way of handling the infrequent drinkers. One solution here is to collect people into groups based on their response to scako, and impute their weekly based on values from others within the group. We could also do this annually (by multiplying the weekly by 52), which would help with the really infrequent drinkers (i.e. less than once or twice a month). We would have to convert back to a weekly consumption. This is difficult, and would probably introduce bias or error where someone has a heavy day once per month which just happens to be in the week before the survey, but could be an improvement as long as we check it thoroughly (potentially using data from Understanding Society or some other survey that reports alcohol consumption).

Self-completion vs Core

Alcohol consumption is only asked in the self-completion questionnaire from wave 4 onwards (where the good data on this is). We need to check that this is not introducing too much bias, or at least be aware of what bias it does introduce. From very quick checks, response to the self-completion questionnaire is higher for younger age groups, higher education levels, married people, healthier people (srh of Good to Excellent), and non-disabled (anyadl == 0). We need to get a good idea of the bias involved so would probably be good to make an R Notebook detailing the differences between the core and self-completion samples.

ld-archer commented 2 years ago

Improve Prediction of Alcbase

Prediction of alcbase currently shows some regression to the mean. High risk drinkers become less common over time until they completely disappear. Including l2alcbase in the transition model is a good start, but there is more to do. First step would be to turn the prediction of consumption into a 2 stage process, starting with whether the simulant drinks alcohol at all, and then how much. Another way to keep hold of the long right tail is to add 'knots' in the prediction of alcbase, similar to the work we did with BMI (BMI has a 'knot' at BMI == 30 to maintain those who are above this point). A good place to start in terms of knot points is the categories of abstainer, moderate, increasingRisk and highRisk, but these are gender specific values so will need to do something about that. Either have each term in the model be an interaction between male and term (i.e. male * hsless) or the better way is to have a separate model for each gender (more complicated but cleaner).

[x] Convert prediction of alcohol into 2 stage process
- [x] Use current drink variable as it is based on scako == 8 (not at all in last 12 months)
- [x] If simulant does drink, predict how much
  - [x] This will require some accounting in HealthModule (i.e. going from drink==1 -> drink == 0 means alcbase->0
[x] Add knots to alcbase values in the alcbase transition model
- [x] Turn alcbase prediction into a gender specific model using different values for knot points

ld-archer commented 2 years ago

This idea has now been superceded slightly. Instead of calculating alcbase for each wave and predicting this variable, I am going to try to include the variables for consumption of each individual drink type (beer, wine, spirits), and predict a value for each of these independently using poisson regression. The combined units drank will then be calculated after prediction. Therefore I'm closing this issue.

ld-archer / E_FEM