opensafely / codelist-development

Repository for discussion of OpenSAFELY codelists
7 stars 4 forks source link

*CLINICAL CONDITION*: Smokers #1

Open sebbacon opened 4 years ago

sebbacon commented 4 years ago

Code for identifying Smokers into categorical data:

@CarolineMorton

@alexwalkercebm - please review

https://github.com/ebmdatalab/tpp-sql-notebook/issues/4

CarolineMorton commented 4 years ago

TPP Chris says this can be created using SQL directly from the db using this algorithm: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5217540/

CarolineMorton commented 4 years ago

Discussed on call: plan for first paper is to use QoF cluster codes in TPP; then second paper ask algorithm to be implemented in TPP or for us to implement algorithm in data

CarolineMorton commented 4 years ago

@sebbacon I have some code from LSHTM in stata on a way to identify smokers but I think that @chris-tpp were possibly going to add this into the data tables. Is that right?

Another option is to use the QoF clusters in the first instance?

sebbacon commented 4 years ago

@chris-tpp has already prepared a cleaned-up version of the QoF codes for us to use in q1.

They will be building the full algorithm at some point (maybe today) and will also supply the SQL so we could reconstruct it ourselves if we wanted.

CarolineMorton commented 4 years ago

Can we have that file @chris-tpp @sebbacon to add here so we can firm up first definition?

CarolineMorton commented 4 years ago

DRAFT

DEFINITION: Latest record of: Patients categorised into Never Smoker, Ex-smoker and Current Smoker and Missing as per cleaned QOF code list provided by TPP (may need to think about status at specific dates later on, but not now).

patient_id smoking_status date_of_smoking_status
1 Current Smoker 01-02-2020
34 Ex-Smoker 8-09-2019
144 Never Smoker 2-01-2014

POTENTIAL BIASES: Worth checking what completeness of the data is using QoF coding - in CPRD there is smoking specific data. If a large amount of missingness then think about additional sources. Current smoker may be more likely to be recorded than never or ex.

CLINICAL SIGN OFF & DATE:

EPIDEMIOLOGY SIGN OFF & DATE:

SHARED WITH WIDER TEAM: Yes/No

FINAL SIGN OFF DATE (and apply label)

alexwalkerepi commented 4 years ago

Worth checking that if patient recorded as non-smoker, but has a smoking code earlier, they are coded as ex-smoker.

CarolineMorton commented 4 years ago

@chris-tpp We have just had a call about this, and we had a few questions about how the smoking algorithm was applied. My understanding is that we are using QOF codes developed by TPP.

chris-tpp commented 4 years ago

Will find out the prevalence question for smoking coding tomorrow morning. We haven’t applied any algorithm yet - we’ve just pulled a list of codes from QOF and labelled them up as non, ex, smoker with a view to adding the algorithm at some point. Will share in the morning too, when back on the office.

CarolineMorton commented 4 years ago

Thank you. Look forward to it.

CarolineMorton commented 4 years ago

Hi @chris-tpp just to come back to this, do you have the codes that you are applying to the smokers/non-smokers/ex-smokers? It would be good to add them to this issue for clarity. Thank you

alexwalkerepi commented 4 years ago

from @chris-tpp by email:

Attached the smoking codes, along with code derivation, category for classification, and numeric status. Some of these will have very low counts but it’s a manageable list so think we classify them all. In brief:

  • The codes have been derived from QOF smoking clusters, a high-level SNOMED code, and term text searches on ‘%tobac%’, ‘%smok%’, ‘%ciga%’ and ‘%pipe%’. The QOF cluster ids / names are attached. The high-level SNOMED code used was ‘365980008’ for ‘Tobacco use and exposure – finding’. We have then examined all the results from the text searches and included any appropriate (for example, there were employment codes for pipe-fitter etc… to take out).
  • The categories required for the algorithm are E – Ex-smoker, N – non-smoker, S – smoker. We’ve also included a category for D for Delete – these look pretty unhelpful and P for Passive, just for completeness. We do not need these yet.
  • Finally, we’ve added a column to indicate if the code can have a numeric value associated with it (for example, for a code with a description “number of cigarettes / day” we need to be careful not to include them as a positive smoker, if the recorded value is zero.

Smoking_Codes_With_Categories_And_Numerics.xlsx Smoking QOF Clusters.txt

alexwalkerepi commented 4 years ago

FINAL

DEFINITION: Latest record of: Patients categorised into Never Smoker, Ex-smoker and Current Smoker and Missing as per cleaned QOF code list provided by TPP (may need to think about status at specific dates later on, but not now).

If feasible: Where patients most recent record has a numeric value relating to the number of cigarettes smoked, and this value is 0, change status code from from S to N before running the below algorithm.

patient_id smoking_status date_of_smoking_status
1 Current Smoker 01-02-2020
34 Ex-Smoker 8-09-2019
144 Never Smoker 2-01-2014

POTENTIAL BIASES: Worth checking what completeness of the data is using QoF coding - in CPRD there is smoking specific data. If a large amount of missingness then think about additional sources. Current smoker may be more likely to be recorded than never or ex.

CLINICAL SIGN OFF & DATE: Caroline Morton @CarolineMorton 7/4/2020 16:05

EPIDEMIOLOGY SIGN OFF & DATE: Alex Walker @alexwalkercebm 7/4/2020 15:41

SHARED WITH WIDER TEAM: Yes

FINAL SIGN OFF DATE: 7/4/2020

ianjdouglas commented 4 years ago

Agree with Final algorithm posted by Alex. But a couple of questions on the Excel list. Several codes are for "Smoking status at 4 weeks" or 52 weeks. Are these codes specific to people who are trying to quit? Either way, should the status for lines 144, 145 and 146 be D rather than S, as the description is uninformative? Same for lines 200 and 211. Line 225 has a D and I think is effectively the same as the others I've mentioned. If it is about quitting, should line 229 be E rather than N? Sorry if this has all been gone over!

StatsFizz commented 4 years ago

I'm sure this is fine, but I don't really follow what is meant here: "If feasible, convert patients most recent record where there's a number of cigarettes and it's 0, from S to N"

alexwalkerepi commented 4 years ago

@StatsFizz sorry this was badly worded. It's in relation to Chris' comment:

Finally, we’ve added a column to indicate if the code can have a numeric value associated with it (for example, for a code with a description “number of cigarettes / day” we need to be careful not to include them as a positive smoker, if the recorded value is zero.

Perhaps: "If feasible: Where patients most recent record has a numeric value relating to the number of cigarettes smoked, and this value is 0, change status code from from S to N."

alexwalkerepi commented 4 years ago

@ianjdouglas I agree that the examples you give are probably related to quitting and therefore ambiguous between S and E. Trouble is changing them to D would mean that we lose their status entirely if there isn't another smoking code.

StatsFizz commented 4 years ago

Thanks that’s much clearer.

Maybe that should come prior to (& be implemented prior to) the other rules (e.g. current smoker – most recent…); i.e. you don’t want someone whose most recent entry was “smokes 0 cigarettes” to be classed as a non-smoker if their last few entries were about heavy smoking (you’d want them to be ex). Does that make sense?

alexwalkerepi commented 4 years ago

@StatsFizz yes I agree, have edited in the definition to reflect that.

ianjdouglas commented 4 years ago

Thanks @alexwalkercebm, that makes sense. Should line 225 also be S in that case? Not a big deal as likely to be a tiny number of entries

ianjdouglas commented 4 years ago

Would be good to get opinion from @LiamSmeeth, @CarolineMorton and @amirmehrkar about how the codes below are used in practice. Are they agnostic to status e.g. could be qualified by a "no" or "none" or are they definite smokers? Currently all classed as current smokers.

137.. | [Tobacco consumption] or [smoker - amount smoked] 137Z. | Tobacco consumption NOS Ub0oo | Tobacco smoking behaviour Ub1nZ | Tobacco use and exposure Ub1tI | Cigarette consumption XE0og | Tobacco smoking consumption XE0sl | Tobacco consumptn: [non-smoker] or [smoker - amount smoked ZV4K0 | [V]Tobacco use

alexwalkerepi commented 4 years ago

@ianjdouglas This was flagged earlier in the issue by Chris, and it is noted in the definition:

If feasible: Where patients most recent record has a numeric value relating to the number of cigarettes smoked, and this value is 0, change status code from from S to N before running the below algorithm.

However this bit isn't currently implemented, due to trying to do things quickly. You're probably right that there are some non-smokers mixed into the smoker category. I'll have a look at whether it's currently possible to implement this. If not then in the short term we can exclude these codes, and in the medium term we can ask Dave to help with making it doable.

alexwalkerepi commented 4 years ago

I've added a new category to the codelist: X where it's a code with an associated numeric value that is likely to correspond to "number of cigarettes smoked" or similar. For the codes that @ianjdouglas identified that don't have a numeric value associated with them, I've just excluded those codes: Smoking_Codes_added_X_category.xlsx The logic to be used now is:

{
    "S": """
        most_recent_smoking_code = 'S' OR (
          most_recent_smoking_code = 'X' AND most_recent_smoking_numeric > 0
        )
    """,
    "E": """
         most_recent_smoking_code = 'E' OR (
           most_recent_smoking_code = 'N' AND ever_smoked
         )
    """,
    "N": """
        (
          most_recent_smoking_code = 'N' OR (
            most_recent_smoking_code = 'X' AND most_recent_smoking_numeric = 0
          )
        ) AND NOT ever_smoked
    """,
    "M": "DEFAULT"
},

This should have multiple effects:

  1. Remove non-smokers classified as smokers (both due to excluding codes and using the extra numeric value logic
  2. Remove non-smokers classified as ex-smokers, due to excluding the uncertain codes being taken into account
  3. Correctly classify non-smokers who are coded with a numeric value, but that value is 0
ianjdouglas commented 4 years ago

Thanks @alexwalkercebm. Just to check, should the bottom part of the code read:

"N": """
    most_recent_smoking_code = 'N' OR (
      most_recent_smoking_**_code_** = 'X' AND most_recent_smoking_numeric = 0
    ) AND NOT ever_smoked
""",
"M": "DEFAULT"

},

ianjdouglas commented 4 years ago

Sorry tried to highlight in bold italics but didn't work. Basically, should most_recent_smoking_numeric be changed to _most_recent_smoking_code in the first half of the line?

alexwalkerepi commented 4 years ago

Yes it should @ianjdouglas I also got the logic slightly wrong and needed an extra set of brackets, have updated above.

hjforbes commented 4 years ago

As discussed, I've added a variable called "clear" to the spreadsheet, where 1=clear smoking code 0=ambiguous smoking code Smoking_Codes_added_X_category_HF.xlsx

Objective is to prioritise clear==1 codes, then use clear==0 if needed.

The ambiguous codes essentially relate to smoking cessation advice, treatment, referral or monitoring; these patients may be currently smoking, or may be an ex-smoker:

I have also flagged in the HF notes column one code which may need reclassifying: "Smoking free weeks" is currently "Smoker", perhaps change to "Ex-smoker"?

Looking at CPRD GOLD data, smoking cessation codes largely appear in records of those looking like "Current smokers", but also appear in records of "Never smokers" and "Ex-smokers". Will try to work out how to summarise this better for you tomorrow.

krishnanbhaskaran commented 4 years ago

Brilliant - thanks Harriet. I think that is a fair breakdown of the more/less clear codes.

Building this into a possible algorithm to resolve conflicts... what about:

If most recent smoking info presents multiple codes on the same date:

1) prioritise codes that are "clear=1" over "clear=0" 2) if after this, there are multiple conflicting codes that have "clear=1" (or no clear codes, but multiple codes with "clear=0"), I would suggest resolve as:

S+E = S (i.e. assume current if current+ex records) S+N = M (i.e. do not try to infer anything if current+never records) E+N = E (i.e. assume ex if ex+never records) S+E+N = M ((i.e. do not try to infer anything if all classifications present)

Also, could we pull a flag into the final dataset that shows whether the final classification was based on a "clear=1" or "clear=0" code, as this would allow us to look at smoking distributions among those with reliable vs less reliable codes, and potentially do sensitivity analyses among those with reliable codes only?

Thoughts @alexwalkercebm @CarolineMorton @hjforbes @ianjdouglas ?

Could it be implemented @evansd ?

ianjdouglas commented 4 years ago

Thanks v much @hjforbes and @krishnanbhaskaran - agree with the algorithm and your thoughts on codes Harriet. Have additionally flagged 2 further codes "Smoking status between 4 and 52 weeks - Current non-smoker" and "Smoking status at 52 weeks - Current non-smoker" as more likely Ex than Never here: Copy of Smoking_Codes_added_X_category_HF_ID.xlsx

hjforbes commented 4 years ago

@krishnanbhaskaran sounds sensible and I agree that a sense analysis on reliable codes only is good. I'm going to apply your algorithm in CPRD data now, to see what the distribution looks like.

Three comments:

  1. I'm not clear about the timing of the codes: i.e. "most recent" versus "ever". Are you envisaging only using "most recent"?

  2. what about a further sense analysis restricting to more recently recorded smoking codes: i.e. "clear=1" codes recorded in last 2 years (say)?

  3. Dose-response among smokers - has a low, medium and high use flag (among smokers only) been considered? If nicotine is protective against developing symptoms when exposed to the virus, we'd maybe expect a dose-response effect.

krishnanbhaskaran commented 4 years ago

Re Harriet comment 1 above, I think currently most recent code is being used. This would mean that an most recent code, even if not "clear=1" would trump an earlier "clear" code even if the time difference was small. This may not be what we'd choose ideally.

One solution would be to split the code list into two separate code lists for your "clear" and "unclear" codes. We could then extract the most recent "clear" classification, plus date (using the above prioritisation algorithm to resolve any conflicts on the same day); AND the most recent "unclear" classification (with same-day conflicts similarly resolved) plus date. This would enable us to put more complex prioritisation into Stata (e.g. prioritise clear codes within the last year, even if there's a more recent unclear code, etc) and run sensitivity analyses.

Dose I doubt we'd get good enough info, based on cprd...

LiamSmeeth commented 4 years ago

This is great to see - many thanks! Am in surgery today but can try and catch up later BW


From: krishnanbhaskaran notifications@github.com Sent: 29 April 2020 09:36 To: ebmdatalab/tpp-sql-notebook tpp-sql-notebook@noreply.github.com Cc: Liam Smeeth Liam.Smeeth@lshtm.ac.uk; Mention mention@noreply.github.com Subject: Re: [ebmdatalab/tpp-sql-notebook] CLINICAL CONDITION: Smokers (#6)

Harriet, I think currently most recent code is being used. This would mean that an most recent code, even if not "clear=1" would trump an earlier "clear" code even if the time difference was small. This may not be what we'd choose ideally.

One solution would be to split the code list into two separate code lists for your "clear" and "unclear" codes. We could then extract the most recent "clear" classification, plus date (using the above prioritisation algorithm to resolve any conflicts on the same day); AND the most recent "unclear" classification (with same-day conflicts similarly resolved) plus date. This would enable us to put more complex prioritisation into Stata (e.g. prioritise clear codes within the last year, even if there's a more recent unclear code, etc) and run sensitivity analyses.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ebmdatalab/tpp-sql-notebook/issues/6#issuecomment-621066199, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APDUVH6U7ATEVN5DTYSRFMDRO7RIBANCNFSM4LXKF2OA.

hjforbes commented 4 years ago

Your approach sounds good Krishnan - can this be implemented @alexwalkercebm? I'm aware the programmers are stretched right now - we could do most recent smoking code and clear code ONLY, and see what distribution of smoking by age/amount missingness we get? Then assess if we need to apply the next level of complexity to the definition?

FYI, findings from CPRD GOLD (population of N~190K, all over 40 years).

evansd commented 4 years ago

Just a thought on this: I've been coming up with a variety of exotic ways we could implement the various bits of logic above but it strikes me that the problem we're facing here is one of too much coding, rather than too little (presumably due to the qof templates Chris was referring to in the call).

So I wonder if a pragmatic solution would be to encoding smoking status using the simple algorithm we first used and the smaller set of unambiguous, frequently used codes. Then we can have a separate column which records whether any of the other codes appear anywhere in the patient's record. That would give us an idea of how many values we might be missing by just using the restricted set.

krishnanbhaskaran commented 4 years ago

So Dave to be clear, is your suggestion that we generate two columns:

1) as per previous extracts: based on most recent record matching the full codelist 2) similar but with the codelist restricted to the "clear==1" codes - i.e. based on the most recent "clear" code.

?

I think that would be fine as a pragmatic first step for something to implement quickly.

The only additional thing would be - is it worth still resolving conflicting codes on the same day, at least in (1) where we know this is happening quite a lot and we're currently effectively picking randomly - I wonder if we could still prioritise "clear" codes where there is a conflict between clear and unclear in (1)?

I think in (2) we'll get far fewer conflicts as most seem to be between a clear and unclear code.

evansd commented 4 years ago

For 2, yes that's exactly what I was thinking. For 1 I was actually thinking of just looking for the unclear codes and recording the latest date. That way we can get a sense of how many of the patients marked as "Missing" in 2 have some sort of smoking code and when that was i.e. how much are we actually losing by ignoring the unclear codes.

krishnanbhaskaran commented 4 years ago

I wonder if we need the date for both (1) and (2) then, in case the latest clear code is very old, so we'd get 4 variables:

1) latest unclear code with date 2) latest clear code with date

?

I think that would then be very flexible in terms of definitions.

evansd commented 4 years ago

That sounds sensible. Will have a look at doing that.

alexwalkerepi commented 4 years ago

The current logic also uses ever smoked in order to code people who have a non-smoker code most recently, but a smoking code in the past as ex. Do we still not still need to incorporate that?

hjforbes commented 4 years ago

I think that's important Alex, it certainly was in CPRD GOLD...

How is "ever smoked" defined? Ever having a CLEAR code for being a current smoker?

alexwalkerepi commented 4 years ago

Currently it's ever having any code for smoking or ex-smoking, but sounds like that should change to CLEAR codes.

evansd commented 4 years ago

Sorry all, there are so many different things going on at the moment and so I think we need to find something we can work with right now that doesn't involve any additional features in the library (i.e. work on my part :smile: )

I think we can still get the four variables which Krishnan outlined, by doing:

  1. The current categorised_as logical expression for smoking_status but with the codelists filtered to include only CLEAR codes.
  2. A separate query to get the date of the latest CLEAR code.
  3. A separate query to get the category of the latest UNCLEAR code, with include_date to get the date also.

@alexwalkercebm Do you think you'd be able to implement that?

It's also worth bearing in mind that the smoking queries are particularly slow and that above adds two more of them so it will probably have a noticeable effect on build time. But if we're having to leave it overnight anyway then maybe that doesn't matter.

alexwalkerepi commented 4 years ago

Yes that should be easy enough to do today @evansd

ianjdouglas commented 4 years ago

Maybe over simplistic, but I wonder if clear/unclear is best used only for resolving same day code conflicts.

Almost all unclear codes are currently categorised as S but many could actually mean E (hence they're unclear). If the most recent code is unclear and is the only code on that day and is preceded by a clear code, then pretty much whatever that clear code is won't help determine a more accurate categorisation. If the clear code was E, they could now be smoking again. If it was S, then it's still S. If it was N, then it's likely N was wrong. e.g. clear "current non-smoker" followed 2 years later by unclear "wants to stop smoking"

Agree prior ever smoker still needs to be considered.

@evansd will your suggestion return all entries from the most recent date if >1 entered?