opensafely-core / ehrql

ehrQL: the electronic health record query language for OpenSAFELY
https://docs.opensafely.org/ehrql/
Other
6 stars 3 forks source link

Expose a "clean" therapeutics table #2023

Closed rebkwok closed 3 months ago

rebkwok commented 3 months ago

See slack thread

To make the covid therapeutics data consistent with the data cohort-extractor provided and easier for users to use.

Current covid_therapeutics_raw table:

Add a new non-raw table that:

Refer to cohort-extractor's implementation: create_therapeutics_table does the removal of duplicated and the comma-separated risk groups (as separate columns). Joining the 3 groups is done here. (Note that we don't need to worry about duplicate risk groups across the 3 risk group columns because only one of those contains data in any one row)

rebkwok commented 3 months ago

To add the new TPP table in ehrQL:

madwort commented 3 months ago

I think you've got a typo for the link to "backend test similar to the one for the raw table" - just checking you meant https://github.com/opensafely-core/ehrql/blob/b2c675017b046e70e136015eba6d168419f0d72b/tests/integration/backends/test_tpp.py#L666-L711

rebkwok commented 3 months ago

Yes, that's the one

rebkwok commented 3 months ago

@acagreen17 Some questions:

1) Which columns should be exposed in the therapeutics table? Cohort-extractor could return values from the following columns:

Are all of these required in ehrQL? Are there any others that should be queryable?

2) There are some (~35) fully duplicate rows in the database table, which cohort-extractor removed, and we can do the same in ehrQL. However, this may (probably will) leave some duplicates of the selected columns that we expose. Is there a subset of columns where duplicates would definitely constitute actual duplication of data (i.e. someone entered data from the same paper form twice)? If we only remove fully duplicate rows, we should document that the data may contain duplicates and users should take steps to address that.

3) The risk group field are taken from the 3 separate risk groups columns, which each relate to a particular intervention (Sotroviman, Molnupiravir, Casirivimab & imdevimab). There are now also other interventions sarilumab, baricitinib, paxlovid and remdesivir - none of these have a corresponding risk group column in the data. Should we just document this in the table docs?

acagreen17 commented 3 months ago
  1. I think everything available via Cohort-extractor should be avaliable via ehrQL. I don't think we need to make any additional columns available either.

  2. From memory there is small group of individuals who will appear to get given two drugs around the same time (because they were given the first line drugs but then switched to a different one) but difficult to know which one they ended up getting (at least it was difficult to know when we were first looking at this data). But I think safer for the researchers to decide on to make this call and just document as a potential limitation.

  3. Yes.

@HelenCEBM might have some useful thoughts on this.