Open sebbacon opened 4 years ago
TPP Chris says this can be created using SQL directly from the db using this algorithm: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5217540/
Discussed on call: plan for first paper is to use QoF cluster codes in TPP; then second paper ask algorithm to be implemented in TPP or for us to implement algorithm in data
@sebbacon I have some code from LSHTM in stata on a way to identify smokers but I think that @chris-tpp were possibly going to add this into the data tables. Is that right?
Another option is to use the QoF clusters in the first instance?
@chris-tpp has already prepared a cleaned-up version of the QoF codes for us to use in q1.
They will be building the full algorithm at some point (maybe today) and will also supply the SQL so we could reconstruct it ourselves if we wanted.
Can we have that file @chris-tpp @sebbacon to add here so we can firm up first definition?
DRAFT
DEFINITION: Latest record of: Patients categorised into Never Smoker, Ex-smoker and Current Smoker and Missing as per cleaned QOF code list provided by TPP (may need to think about status at specific dates later on, but not now).
patient_id | smoking_status | date_of_smoking_status |
---|---|---|
1 | Current Smoker | 01-02-2020 |
34 | Ex-Smoker | 8-09-2019 |
144 | Never Smoker | 2-01-2014 |
POTENTIAL BIASES: Worth checking what completeness of the data is using QoF coding - in CPRD there is smoking specific data. If a large amount of missingness then think about additional sources. Current smoker may be more likely to be recorded than never or ex.
CLINICAL SIGN OFF & DATE:
EPIDEMIOLOGY SIGN OFF & DATE:
SHARED WITH WIDER TEAM: Yes/No
FINAL SIGN OFF DATE (and apply label)
Worth checking that if patient recorded as non-smoker, but has a smoking code earlier, they are coded as ex-smoker.
@chris-tpp We have just had a call about this, and we had a few questions about how the smoking algorithm was applied. My understanding is that we are using QOF codes developed by TPP.
Will find out the prevalence question for smoking coding tomorrow morning. We haven’t applied any algorithm yet - we’ve just pulled a list of codes from QOF and labelled them up as non, ex, smoker with a view to adding the algorithm at some point. Will share in the morning too, when back on the office.
Thank you. Look forward to it.
Hi @chris-tpp just to come back to this, do you have the codes that you are applying to the smokers/non-smokers/ex-smokers? It would be good to add them to this issue for clarity. Thank you
from @chris-tpp by email:
Attached the smoking codes, along with code derivation, category for classification, and numeric status. Some of these will have very low counts but it’s a manageable list so think we classify them all. In brief:
- The codes have been derived from QOF smoking clusters, a high-level SNOMED code, and term text searches on ‘%tobac%’, ‘%smok%’, ‘%ciga%’ and ‘%pipe%’. The QOF cluster ids / names are attached. The high-level SNOMED code used was ‘365980008’ for ‘Tobacco use and exposure – finding’. We have then examined all the results from the text searches and included any appropriate (for example, there were employment codes for pipe-fitter etc… to take out).
- The categories required for the algorithm are E – Ex-smoker, N – non-smoker, S – smoker. We’ve also included a category for D for Delete – these look pretty unhelpful and P for Passive, just for completeness. We do not need these yet.
- Finally, we’ve added a column to indicate if the code can have a numeric value associated with it (for example, for a code with a description “number of cigarettes / day” we need to be careful not to include them as a positive smoker, if the recorded value is zero.
Smoking_Codes_With_Categories_And_Numerics.xlsx Smoking QOF Clusters.txt
FINAL
DEFINITION: Latest record of: Patients categorised into Never Smoker, Ex-smoker and Current Smoker and Missing as per cleaned QOF code list provided by TPP (may need to think about status at specific dates later on, but not now).
If feasible: Where patients most recent record has a numeric value relating to the number of cigarettes smoked, and this value is 0, change status code from from S
to N
before running the below algorithm.
S
E
) OR (most recent code is N
AND an S
or E
code at any point)N
AND doesn't have any S
or E
codes at any pointS
, E
or N
smoking codes on recordpatient_id | smoking_status | date_of_smoking_status |
---|---|---|
1 | Current Smoker | 01-02-2020 |
34 | Ex-Smoker | 8-09-2019 |
144 | Never Smoker | 2-01-2014 |
POTENTIAL BIASES: Worth checking what completeness of the data is using QoF coding - in CPRD there is smoking specific data. If a large amount of missingness then think about additional sources. Current smoker may be more likely to be recorded than never or ex.
CLINICAL SIGN OFF & DATE: Caroline Morton @CarolineMorton 7/4/2020 16:05
EPIDEMIOLOGY SIGN OFF & DATE: Alex Walker @alexwalkercebm 7/4/2020 15:41
SHARED WITH WIDER TEAM: Yes
FINAL SIGN OFF DATE: 7/4/2020
Agree with Final algorithm posted by Alex. But a couple of questions on the Excel list. Several codes are for "Smoking status at 4 weeks" or 52 weeks. Are these codes specific to people who are trying to quit? Either way, should the status for lines 144, 145 and 146 be D rather than S, as the description is uninformative? Same for lines 200 and 211. Line 225 has a D and I think is effectively the same as the others I've mentioned. If it is about quitting, should line 229 be E rather than N? Sorry if this has all been gone over!
I'm sure this is fine, but I don't really follow what is meant here: "If feasible, convert patients most recent record where there's a number of cigarettes and it's 0, from S to N"
@StatsFizz sorry this was badly worded. It's in relation to Chris' comment:
Finally, we’ve added a column to indicate if the code can have a numeric value associated with it (for example, for a code with a description “number of cigarettes / day” we need to be careful not to include them as a positive smoker, if the recorded value is zero.
Perhaps: "If feasible: Where patients most recent record has a numeric value relating to the number of cigarettes smoked, and this value is 0, change status code from from S to N."
@ianjdouglas I agree that the examples you give are probably related to quitting and therefore ambiguous between S
and E
. Trouble is changing them to D
would mean that we lose their status entirely if there isn't another smoking code.
Thanks that’s much clearer.
Maybe that should come prior to (& be implemented prior to) the other rules (e.g. current smoker – most recent…); i.e. you don’t want someone whose most recent entry was “smokes 0 cigarettes” to be classed as a non-smoker if their last few entries were about heavy smoking (you’d want them to be ex). Does that make sense?
@StatsFizz yes I agree, have edited in the definition to reflect that.
Thanks @alexwalkercebm, that makes sense. Should line 225 also be S in that case? Not a big deal as likely to be a tiny number of entries
Would be good to get opinion from @LiamSmeeth, @CarolineMorton and @amirmehrkar about how the codes below are used in practice. Are they agnostic to status e.g. could be qualified by a "no" or "none" or are they definite smokers? Currently all classed as current smokers.
137.. | [Tobacco consumption] or [smoker - amount smoked] 137Z. | Tobacco consumption NOS Ub0oo | Tobacco smoking behaviour Ub1nZ | Tobacco use and exposure Ub1tI | Cigarette consumption XE0og | Tobacco smoking consumption XE0sl | Tobacco consumptn: [non-smoker] or [smoker - amount smoked ZV4K0 | [V]Tobacco use
@ianjdouglas This was flagged earlier in the issue by Chris, and it is noted in the definition:
If feasible: Where patients most recent record has a numeric value relating to the number of cigarettes smoked, and this value is 0, change status code from from S to N before running the below algorithm.
However this bit isn't currently implemented, due to trying to do things quickly. You're probably right that there are some non-smokers mixed into the smoker category. I'll have a look at whether it's currently possible to implement this. If not then in the short term we can exclude these codes, and in the medium term we can ask Dave to help with making it doable.
I've added a new category to the codelist: X
where it's a code with an associated numeric value that is likely to correspond to "number of cigarettes smoked" or similar. For the codes that @ianjdouglas identified that don't have a numeric value associated with them, I've just excluded those codes:
Smoking_Codes_added_X_category.xlsx
The logic to be used now is:
{
"S": """
most_recent_smoking_code = 'S' OR (
most_recent_smoking_code = 'X' AND most_recent_smoking_numeric > 0
)
""",
"E": """
most_recent_smoking_code = 'E' OR (
most_recent_smoking_code = 'N' AND ever_smoked
)
""",
"N": """
(
most_recent_smoking_code = 'N' OR (
most_recent_smoking_code = 'X' AND most_recent_smoking_numeric = 0
)
) AND NOT ever_smoked
""",
"M": "DEFAULT"
},
This should have multiple effects:
0
Thanks @alexwalkercebm. Just to check, should the bottom part of the code read:
"N": """
most_recent_smoking_code = 'N' OR (
most_recent_smoking_**_code_** = 'X' AND most_recent_smoking_numeric = 0
) AND NOT ever_smoked
""",
"M": "DEFAULT"
},
Sorry tried to highlight in bold italics but didn't work. Basically, should most_recent_smoking_numeric be changed to _most_recent_smoking_code in the first half of the line?
Yes it should @ianjdouglas I also got the logic slightly wrong and needed an extra set of brackets, have updated above.
As discussed, I've added a variable called "clear" to the spreadsheet, where 1=clear smoking code 0=ambiguous smoking code Smoking_Codes_added_X_category_HF.xlsx
Objective is to prioritise clear==1 codes, then use clear==0 if needed.
The ambiguous codes essentially relate to smoking cessation advice, treatment, referral or monitoring; these patients may be currently smoking, or may be an ex-smoker:
I have also flagged in the HF notes column one code which may need reclassifying: "Smoking free weeks" is currently "Smoker", perhaps change to "Ex-smoker"?
Looking at CPRD GOLD data, smoking cessation codes largely appear in records of those looking like "Current smokers", but also appear in records of "Never smokers" and "Ex-smokers". Will try to work out how to summarise this better for you tomorrow.
Brilliant - thanks Harriet. I think that is a fair breakdown of the more/less clear codes.
Building this into a possible algorithm to resolve conflicts... what about:
If most recent smoking info presents multiple codes on the same date:
1) prioritise codes that are "clear=1" over "clear=0" 2) if after this, there are multiple conflicting codes that have "clear=1" (or no clear codes, but multiple codes with "clear=0"), I would suggest resolve as:
S+E = S (i.e. assume current if current+ex records) S+N = M (i.e. do not try to infer anything if current+never records) E+N = E (i.e. assume ex if ex+never records) S+E+N = M ((i.e. do not try to infer anything if all classifications present)
Also, could we pull a flag into the final dataset that shows whether the final classification was based on a "clear=1" or "clear=0" code, as this would allow us to look at smoking distributions among those with reliable vs less reliable codes, and potentially do sensitivity analyses among those with reliable codes only?
Thoughts @alexwalkercebm @CarolineMorton @hjforbes @ianjdouglas ?
Could it be implemented @evansd ?
Thanks v much @hjforbes and @krishnanbhaskaran - agree with the algorithm and your thoughts on codes Harriet. Have additionally flagged 2 further codes "Smoking status between 4 and 52 weeks - Current non-smoker" and "Smoking status at 52 weeks - Current non-smoker" as more likely Ex than Never here: Copy of Smoking_Codes_added_X_category_HF_ID.xlsx
@krishnanbhaskaran sounds sensible and I agree that a sense analysis on reliable codes only is good. I'm going to apply your algorithm in CPRD data now, to see what the distribution looks like.
Three comments:
I'm not clear about the timing of the codes: i.e. "most recent" versus "ever". Are you envisaging only using "most recent"?
what about a further sense analysis restricting to more recently recorded smoking codes: i.e. "clear=1" codes recorded in last 2 years (say)?
Dose-response among smokers - has a low, medium and high use flag (among smokers only) been considered? If nicotine is protective against developing symptoms when exposed to the virus, we'd maybe expect a dose-response effect.
Re Harriet comment 1 above, I think currently most recent code is being used. This would mean that an most recent code, even if not "clear=1" would trump an earlier "clear" code even if the time difference was small. This may not be what we'd choose ideally.
One solution would be to split the code list into two separate code lists for your "clear" and "unclear" codes. We could then extract the most recent "clear" classification, plus date (using the above prioritisation algorithm to resolve any conflicts on the same day); AND the most recent "unclear" classification (with same-day conflicts similarly resolved) plus date. This would enable us to put more complex prioritisation into Stata (e.g. prioritise clear codes within the last year, even if there's a more recent unclear code, etc) and run sensitivity analyses.
Dose I doubt we'd get good enough info, based on cprd...
This is great to see - many thanks! Am in surgery today but can try and catch up later BW
From: krishnanbhaskaran notifications@github.com Sent: 29 April 2020 09:36 To: ebmdatalab/tpp-sql-notebook tpp-sql-notebook@noreply.github.com Cc: Liam Smeeth Liam.Smeeth@lshtm.ac.uk; Mention mention@noreply.github.com Subject: Re: [ebmdatalab/tpp-sql-notebook] CLINICAL CONDITION: Smokers (#6)
Harriet, I think currently most recent code is being used. This would mean that an most recent code, even if not "clear=1" would trump an earlier "clear" code even if the time difference was small. This may not be what we'd choose ideally.
One solution would be to split the code list into two separate code lists for your "clear" and "unclear" codes. We could then extract the most recent "clear" classification, plus date (using the above prioritisation algorithm to resolve any conflicts on the same day); AND the most recent "unclear" classification (with same-day conflicts similarly resolved) plus date. This would enable us to put more complex prioritisation into Stata (e.g. prioritise clear codes within the last year, even if there's a more recent unclear code, etc) and run sensitivity analyses.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ebmdatalab/tpp-sql-notebook/issues/6#issuecomment-621066199, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APDUVH6U7ATEVN5DTYSRFMDRO7RIBANCNFSM4LXKF2OA.
Your approach sounds good Krishnan - can this be implemented @alexwalkercebm? I'm aware the programmers are stretched right now - we could do most recent smoking code and clear code ONLY, and see what distribution of smoking by age/amount missingness we get? Then assess if we need to apply the next level of complexity to the definition?
FYI, findings from CPRD GOLD (population of N~190K, all over 40 years).
Just a thought on this: I've been coming up with a variety of exotic ways we could implement the various bits of logic above but it strikes me that the problem we're facing here is one of too much coding, rather than too little (presumably due to the qof templates Chris was referring to in the call).
So I wonder if a pragmatic solution would be to encoding smoking status using the simple algorithm we first used and the smaller set of unambiguous, frequently used codes. Then we can have a separate column which records whether any of the other codes appear anywhere in the patient's record. That would give us an idea of how many values we might be missing by just using the restricted set.
So Dave to be clear, is your suggestion that we generate two columns:
1) as per previous extracts: based on most recent record matching the full codelist 2) similar but with the codelist restricted to the "clear==1" codes - i.e. based on the most recent "clear" code.
?
I think that would be fine as a pragmatic first step for something to implement quickly.
The only additional thing would be - is it worth still resolving conflicting codes on the same day, at least in (1) where we know this is happening quite a lot and we're currently effectively picking randomly - I wonder if we could still prioritise "clear" codes where there is a conflict between clear and unclear in (1)?
I think in (2) we'll get far fewer conflicts as most seem to be between a clear and unclear code.
For 2, yes that's exactly what I was thinking. For 1 I was actually thinking of just looking for the unclear codes and recording the latest date. That way we can get a sense of how many of the patients marked as "Missing" in 2 have some sort of smoking code and when that was i.e. how much are we actually losing by ignoring the unclear codes.
I wonder if we need the date for both (1) and (2) then, in case the latest clear code is very old, so we'd get 4 variables:
1) latest unclear code with date 2) latest clear code with date
?
I think that would then be very flexible in terms of definitions.
That sounds sensible. Will have a look at doing that.
The current logic also uses ever smoked
in order to code people who have a non-smoker code most recently, but a smoking code in the past as ex. Do we still not still need to incorporate that?
I think that's important Alex, it certainly was in CPRD GOLD...
How is "ever smoked" defined? Ever having a CLEAR code for being a current smoker?
Currently it's ever having any code for smoking or ex-smoking, but sounds like that should change to CLEAR codes.
Sorry all, there are so many different things going on at the moment and so I think we need to find something we can work with right now that doesn't involve any additional features in the library (i.e. work on my part :smile: )
I think we can still get the four variables which Krishnan outlined, by doing:
categorised_as
logical expression for smoking_status but with the codelists filtered to include only CLEAR codes.include_date
to get the date also.@alexwalkercebm Do you think you'd be able to implement that?
It's also worth bearing in mind that the smoking queries are particularly slow and that above adds two more of them so it will probably have a noticeable effect on build time. But if we're having to leave it overnight anyway then maybe that doesn't matter.
Yes that should be easy enough to do today @evansd
Maybe over simplistic, but I wonder if clear/unclear is best used only for resolving same day code conflicts.
Almost all unclear codes are currently categorised as S but many could actually mean E (hence they're unclear). If the most recent code is unclear and is the only code on that day and is preceded by a clear code, then pretty much whatever that clear code is won't help determine a more accurate categorisation. If the clear code was E, they could now be smoking again. If it was S, then it's still S. If it was N, then it's likely N was wrong. e.g. clear "current non-smoker" followed 2 years later by unclear "wants to stop smoking"
Agree prior ever smoker still needs to be considered.
@evansd will your suggestion return all entries from the most recent date if >1 entered?
Code for identifying Smokers into categorical data:
@CarolineMorton
@alexwalkercebm - please review
https://github.com/ebmdatalab/tpp-sql-notebook/issues/4