opensafely / tpp-sql-notebook

2 stars 0 forks source link

Immunosuppression and cancer variables and defining the 'shielded patients' group #58

Open alexwalkerepi opened 4 years ago

alexwalkerepi commented 4 years ago

From @hmcd by email:

Krishnan, Caroline, Alex and I had a detailed discussion today about how to consider the separate but overlapping concepts of immunosuppression and cancer. I said I’d try to separate out the codelists this evening...

The ‘big’ exposure groups that we’ve currently got in our study protocol are the underlying health conditions list in the social distancing guidance, plus others we thought perhaps should be considered for it. These include ‘cancer’ and ‘immunosuppression’.

Liam and Ben were very keen on us also being able to define the ‘vulnerable patients’ list, since renamed as the ‘shielded patients’ list. This picks out particular groups, some of which cut across cancer vs immunosuppression. This is tricky without secondary care prescriptions, but somewhat doable if we make assumptions about duration of cancer treatment and ignore treatment for relapses – provided we keep the variables granular enough to be able to pick out certain groups.

Of note, the list is also on it’s 3rd iteration since 23 March (when it was the ‘highest risk’ groups), so different groupings might be needed to reflect guidance in next round of these studies.

I know it’s more efficient to interrogate the 20m records fewer times but I think there’s a strong case for being able to distinguish different aspects of immunosuppression to be able to define broad concepts of cancer and immunosuppression, and also detailed subsets such as the ‘shielded patients’ list, and being able to adapt to likely changes in the guidance on the most vulnerable groups as we go on, particularly if we want to reuse the code for more than this first study.

Trying to keep it efficient, I still come up with 15 exposure types related to cancer and/or immunosuppression that I think it would be useful to keep separate. We’ve codelists for all of them, the only potential new one to consider is sickle cell disease (which I have and can share) – but what I’m proposing is that we might want to keep these separate variables, so that studies can group them differently according to what they are investigating, and in particular so that we keep the level of detail needed to identify the ‘shielded patients’.

The list is below, with codelists against each group. Underneath that is some rationale for why I think we might want these exposures separated. I would have said that 8 (sickle cell disease) is a nice to have, except that it does feature in the ‘shielded patients’ list. It’s also clinically tempting to combine aplastic anaemia and haematological cancer – except that then we can’t describe ‘cancer’ without aplastic anaemia sitting in it.

What are your thoughts? Is this many variables too tough on the SQL extraction? (If so, could we put all the codelists together (with variables flagging e.g. spleen==1) and then separate them back out in STATA?)

Liam has highlighted that he thinks the shielded patients list is high priority in general – I think given that we have the codelists, it’d be a shame to lose the ability to distinguish these patients. But I’ve never worked with a dataset of 20m patients, so appreciate this may be too much!

All the best, Helen

  Exposure Codelist to map to v3 Already shared?
1 Lung cancer Krishnan and Helen Strongman’s cancer list – lung Long list shared – can divide by site and share split into these 3 categories
2 Haematological cancer Krishnan and Helen Strongman’s cancer list – haematological
3 Other malignant cancer the rest of Krishnan and Helen Strongman’s cancer list
4 Bone marrow transplant bone_marrow_stemcell_July18 yes
5 Aplastic anaemia aplastic_anaemia_updated_July 18 yes
6 Chemotherapy/radiotherapy chemo_radiotherapy_not_end_Jul18 yes
7 Dysplenia spleen_Jul19 yes
8 Sickle cell disease   no but I have one
9 HIV HIV_Jul19 yes
10 Organ transplant organ_tx_simple_July18 yes
11 Genetic conditions that increase risk of infections other_cmi_immuno_updated_Jul18 yes
12 Other immunosuppression nonsp_cmi_immuno_updated_Jul18 yes
13 Oral steroids yes but being derived separately by Brian McKenna using DM+D, and working out how to implement threshold approximation using number of tablets rather than dose.
14 Biologics
15 Other immunosuppressants (DMARDS)

Examples of how we would use these in different combinations: The following have occurred as separate concepts in variations of the guidance on high risk/vulnerable groups: 1, 2, 4, 6, 7 with and without 8, 10, 11. And I would expect 14 will at some point. Combining categories would allow us to define: • Cancer (1 OR 2 OR 3) • No working bone marrow (2 OR 4 OR 5) • Immunosuppressive condition (5 OR 7 OR 8 OR 9 OR 10 OR 11 OR 12) and discuss whether 4 belongs here? • Immunosuppressive medication (6 OR 12 OR 13 OR 14 OR 15)

Inclusion criteria How could we do it using the exposure categories above?
People who have had an organ transplant who remain on long term immune suppression therapy 10 ever
with cancer who are undergoing active chemotherapy or radical radiotherapy for lung cancer 1, first code within set period of time (e.g.3-6 months)
with cancers of the blood or bone marrow such as leukaemia, lymphoma or myeloma who are at any stage of treatment 2, first code within set period of time
having immunotherapy or other continuing antibody treatments for cancer haven’t got secondary care prescriptions - are there any types of cancer we could use for this?
having other targeted cancer treatments which can affect the immune system, such as protein kinase inhibitors or PARP inhibitors haven’t got secondary care prescriptions - are there any types of cancer we could use for this?
who have had bone marrow or stem cell transplants in the last 6 months, or who are still taking immunosuppression drugs 4, within last 6 months
People with severe respiratory conditions including all cystic fibrosis, severe asthma and severe COPD. Severe asthmatics are those who are frequently prescribed high dose steroid tablets cystic fibrosis list asthma or chronic respiratory disease who meet thresholds for oral steroids (13 above)
People with rare diseases and inborn errors of metabolism that significantly increase the risk of infections (such as SCID, homozygous sickle cell) 11
People on immunosuppression therapies sufficient to significantly increase risk of infection 12 OR 13 OR 14 OR 15 – should check NHS Digital methods for whether 6 should belong here
People who are pregnant with significant congenital heart disease Would be tricky without good indicator of pregnancy status. Maria Peppa at LSHTM has put together thorough congenital heart disease lists if we want them for the future.
alexwalkerepi commented 4 years ago

To summarise where to find definitions for each of these variables:

  Exposure Issue
1 Lung cancer #32
2 Haematological cancer #32
3 Other malignant cancer #32
4 Bone marrow transplant #32
5 Aplastic anaemia #36
6 Chemotherapy/radiotherapy #32
7 Dysplenia #13
8 Sickle cell disease #13
9 HIV #36
10 Organ transplant #31
11 Genetic conditions that increase risk of infections #36
12 Other immunosuppression #36
13 Oral steroids #25
14 Biologics tbc
15 Other immunosuppressants (DMARDS) #23
alexwalkerepi commented 4 years ago

From @krishnanbhaskaran :

hi Helen,

Thanks for thinking so much about this.

From an epi perspective always good to have the finest breakdown so that we can then combine in a flexible way (esp as guidance may change).. but I guess it comes down to practicalities for the programmers? Some of the data quality of these items will be poor (e.g. hiv, chemo, in primary care records) – likely woeful sensitivity but should at least be specific. I don’t know of any decent source of cancer immunotherapy/targeted therapy, and suspect cancer site would be a poor proxy. A decent proportion of haematological may get? But again best we can do regarding likelihood of active treatment would probably be having a recent diagnosis.

Note that we’ll need a date for some of these, e.g. date of first mention of a cancer, in order to identify how long ago the cancer was, for (a) identifying cancers in the last X months/year and (b) investigating the role of cancer by recency of diagnosis more generally. With cancer as discussed I would also say pull the actual Read code so that we can map to site in future – we have a mapping from Read 2 to cancer site (encoded in icd).

I for one get muddled between the shielding criteria and the broader “should be particularly careful with social distancing” criteria, so in case others the same, the relevant lists are here:

shielding: https://digital.nhs.uk/coronavirus/shielded-patient-list

general high risk: https://www.gov.uk/government/publications/covid-19-guidance-on-social-distancing-and-for-vulnerable-people/guidance-on-social-distancing-for-everyone-in-the-uk-and-protecting-older-people-and-vulnerable-adults

best wishes, Krishnan

alexwalkerepi commented 4 years ago

from @StatsFizz

I agree that if possible we should keep the codelists separate with a view to doing more granular analyses in future. It’s really easy to combine codelists to define bigger conditions, but not really feasible to go the other way, once we’ve extracted the data!

However, I guess it depends on the immediate time investment. It may be more time efficient, in the short term, to extract the closest thing to what we want immediately and subsequently do more detailed codelists for more granular conditions. But the neatest way would be to start with the more granular versions. I guess that’s one for Alex and Caroline.

Also, I have a workflow question. I’m a bit unsure about whether the timing forms a part of the codelist definition or not. E.g. if we want “cancer in the last 5 years” and also “cancer ever”, does that get stored as two separate codelists? Or do we have a codelist plus a time definition which forms the data extract? Sorry, I know this is probably a dumb question for those of you used to this workflow…

Fizz

alexwalkerepi commented 4 years ago

from @krishnanbhaskaran

Question of timing is important I think for things like cancer. The issue with specific variables like “cancer last 5y”, “cancer ever” is they are inflexible – so if we want to look at different time windows (6m, 1y, 10y?) we’ve lost the relevant info.. ideally for me it’s better to have date of first mention of cancer, from which any study-specific variable based on time windows from the index date can be derived.

So if separating haematological, lung and “other” out as suggested, we could have 5 variables in total:

  1. date of first haematological malignancy (code as missing if none present)

  2. Read code of first haematological malignancy (so that exact type can be determined later, if needed)

  3. date of first lung cancer (code as missing if none present)

  4. date of first other cancer (code as missing if none present)

  5. Read code of first other cancer (so that site can be determined later, if needed)

Krishnan

alexwalkerepi commented 4 years ago

Hi all, for most of the conditions Helen listed we’ve started separate issues for already, and returning a separate column for each codelist should be fine programmatically. I’ve summarised the issues that each one is in here, but I’m still working on the definitions for most of them. https://github.com/ebmdatalab/tpp-sql-notebook/issues/58#issuecomment-608335613

Regarding dates, at this point it might be simplest to say that the default is to take the date of the first code in each codelist, unless there are any specific exceptions. Then each analysis can further specify the definition as needed.

StatsFizz commented 4 years ago

For cancer, it makes sense to me to take the date of the first code for each sub-list. But I don't know if that's a sensible generic rule across all conditions. Might there be other instances where our "X ever" "X in last 5 years" would be about any instance of "X" in the last 5 years, so we'd want the most recent date? Looking through Helen's list, I'm not sure if this applies to any of them though....

krishnanbhaskaran commented 4 years ago

Yes I think with cancer anything after the first is hard to interpret (very hard to distinguish new cancers, secondaries, repeat codes still referring to the original cancer, etc.). There may be other types of condition that are more sporadic in nature where the most recent date would be of interest (though as you say, not obviously for the clinical conditions in Helen's shielding-based list [or indeed for the general high-risk clinical conditions which (pregnancy aside!) I think are all chronic in nature]).