Expose a "clean" therapeutics table

rebkwok commented 3 months ago

To make the covid therapeutics data consistent with the data cohort-extractor provided and easier for users to use.

Current covid_therapeutics_raw table:

remove the onset of symptoms columns
restrict the dataset?

Add a new non-raw table that:

only exposes required fields
removes fully duplicate rows
removes "filler" words from the 3 risk group fields and joins as single risk group field (a comma-separated string)
casts datetimes to date (already done in the raw table) (cohort extractor applied collation to the intervention and currentstatus columns, but according to the database report they already have the applied collation, so that should be unnecessary.

Refer to cohort-extractor's implementation: create_therapeutics_table does the removal of duplicated and the comma-separated risk groups (as separate columns). Joining the 3 groups is done here. (Note that we don't need to worry about duplicate risk groups across the 3 risk group columns because only one of those contains data in any one row)

rebkwok commented 3 months ago

To add the new TPP table in ehrQL:

Add to the Backend as a new QueryTable (here's where the raw one is defined)
Add a table (EventFrame, as the therapeutics table contains multiple rows per patient) with docstring, to tpp/tables.py
Add a backend test similar to the one for the raw table (with relevant input data to check for the collated strings etc)

madwort commented 3 months ago

I think you've got a typo for the link to "backend test similar to the one for the raw table" - just checking you meant https://github.com/opensafely-core/ehrql/blob/b2c675017b046e70e136015eba6d168419f0d72b/tests/integration/backends/test_tpp.py#L666-L711

rebkwok commented 3 months ago

Yes, that's the one

rebkwok commented 3 months ago

@acagreen17 Some questions:

1) Which columns should be exposed in the therapeutics table? Cohort-extractor could return values from the following columns:

covid_indication
intervention
current_status
risk_group (a combination of the 3 separate risk group column as a comma-separated list)
treatment_start_date
region

Are all of these required in ehrQL? Are there any others that should be queryable?

2) There are some (~35) fully duplicate rows in the database table, which cohort-extractor removed, and we can do the same in ehrQL. However, this may (probably will) leave some duplicates of the selected columns that we expose. Is there a subset of columns where duplicates would definitely constitute actual duplication of data (i.e. someone entered data from the same paper form twice)? If we only remove fully duplicate rows, we should document that the data may contain duplicates and users should take steps to address that.

3) The risk group field are taken from the 3 separate risk groups columns, which each relate to a particular intervention (Sotroviman, Molnupiravir, Casirivimab & imdevimab). There are now also other interventions sarilumab, baricitinib, paxlovid and remdesivir - none of these have a corresponding risk group column in the data. Should we just document this in the table docs?

acagreen17 commented 3 months ago

I think everything available via Cohort-extractor should be avaliable via ehrQL. I don't think we need to make any additional columns available either.
From memory there is small group of individuals who will appear to get given two drugs around the same time (because they were given the first line drugs but then switched to a different one) but difficult to know which one they ended up getting (at least it was difficult to know when we were first looking at this data). But I think safer for the researchers to decide on to make this call and just document as a potential limitation.
Yes.

@HelenCEBM might have some useful thoughts on this.

opensafely-core / ehrql

Expose a "clean" therapeutics table #2023