Developing backend functionality to analysis ONS CIS extract

amirmehrkar commented 3 years ago

ONS is now sharing ONS Community Infection Survey (ONS-CIS) data with TPP - see here.

ONS are having regular governance calls with OpenSAFELY to decide which studies to conduct against this ONS-CIS dataset.

Importantly:

ONS is the decision maker on which studies can be analysed against this.
The ONS-CIS data must ONLY be processed if the ONS governance board approve the project

We are due to identify a small handful of projects soon.

@alexwalkercebm is on the discussion committee alongside Laurie Tomlinson. @sebbacon is there a particular development workflow we need to use to go from ingestion; backend; analyse?

BeckyLumbard commented 3 years ago

https://docs.google.com/spreadsheets/d/1fURxrpXv2v7cyrm-tagAwmtEHvKVSOn9FbAPIx3PsTU/edit#gid=0

amirmehrkar commented 3 years ago

the schema is reference in the above data ingest template but here it is direct:

Updated with LASTEST 04-11 schema https://docs.google.com/spreadsheets/d/1IhcPWpgHUocFOkCcil0NE2FyyELCyiGd/edit?usp=sharing&ouid=102835591321385431966&rtpof=true&sd=true

I have been informed that the yellow fields are necessary for the current Mental health application (which has been approved) @sebbacon and the rest are highly likely to be necessary. so maybe we should work to ingest all?

I have asked ONS-CIS to discuss direct with TPP to match and send the data to TPP. What other dev implications etc do we need to think about?

Are we in a position to do this with EMIS now too? Ie shall we delay the extract to EMIS temporarily?

sebbacon commented 3 years ago

Thanks. Alex has been doing a great job helping shape up and scope the data for ingest for ISARIC; if possible we should follow the same process here.

alexwalkerepi commented 3 years ago

I can do similar again for this, and should be able to get started with the schema linked above.

@amirmehrkar, who do you think is the best ONS contact if I have questions about the data structure or user requirements?

amirmehrkar commented 3 years ago

Dan has highlighted the fields yellow I believe now.

@alexwalkerepi suggest email both Daniel.Ayoubkhani@ons.gov.uk - who is doing first study - and caroline.youell@ons.gov.uk who supplied the schema. Probably keep TPP CC in / and edit the excel to clarify things as you find them out?

Dan application is project ID 24 in my (headache!) spreadsheet; I’ll send you link to it on slack as this issue is public and don’t want application form in public yet; the form google link is “public” to allow for group editing.

LindaNab commented 2 years ago

In consultation with @alexwalkerepi I've looked at the data dictionary.

My current understanding of the data is:

Wide data (containing 187 columns)
Given the number of fields, we may need to let users specify what field they want to be returned, similar to what has been done for ISARIC, "I think that we should write functions to return the contents of user specified columns"
14 fields containing dates (filtering may be needed); 17 numeric fields; and 153 fields containing categorical data.
3 character fields with ids. Field "participant_id" contains the unique identifier for database linkage, and I am assuming there could possibly be more rows per participant in the database, which are uniquely identified by field: "visit_id". In addition, "hh_id_fake" identifies the household (pseudonymised); participants can be grouped per household using this field.

The difficulty is to reduce to one row per patient, we can presumably use field "visit_date" to identify fe the first visit or last visit. But participants could presumably have more than 2 visits, I am not sure what users need/ if my assumption that participants could have multiple visits (and consequently, multiple rows) is correct. Pseudocode and more detail to follow once I understand what visit should be returned.

alexwalkerepi commented 2 years ago

I agree with Linda's interpretation, that we can do something similar to ISARIC where the user requests the column they want to return. The challenge of multiple rows per patient makes it trickier, especially given that there are multiple date fields that could be used. As well as visit_date, there are other dates that could plausibly be used to filter the rows, like covid_test_swab_pos_first_date and sympt_now_date. Presumably this would mean that the user would also have to specify the date too. something like:

Date variable

    cis_visit_date=patients.with_an_ons_cis_record(
        return_column="visit_date",
        return_expectations={"incidence": 0.1, "date": {"earliest": "2020-02-01"}}
    ),

Categorical variable:

    long_covid_cough=patients.with_an_ons_cis_record(
        between=["2020-02-01", "2020-06-01"],
        date_filter_column="visit_date",
        find_first_match_in_period=True,
        return_column="long_covid_cough",
        return_expectations={
            "rate": "universal",
            "category": {
                "ratios": {
                    "Yes": 0.7,
                    "No": 0.1,
                    "(Missing)": 0.2,
                }
            },
        },
    ),

Float variable:

    isaric_sysbp=patients.with_an_ons_cis_record(
        between=["2020-02-01", "2020-06-01"],
        date_filter_column="visit_date",
        find_first_match_in_period=True,
        return_column="ct_mean",
        return_expectations={
            "incidence": 0.8,
            "float": {"distribution": "normal", "mean": 21, "stddev": 5},
        },
    ),

amirmehrkar commented 2 years ago

@andrewscolm is analysis restriction added to this too?

sebbacon commented 2 years ago

Whoever picks this up - liase with @alexwalkerepi

rebkwok commented 2 years ago

@alexwalkerepi I have a few questions:

1) There are 11 date-type columns in the ONS_CIS data. Should we allow any of them to be specified as the date_filter_column?

These are the date-type columns:

"covid_date"
"covid_test_blood_neg_last_date"
"covid_test_blood_pos_first_date"
"covid_test_swab_neg_last_date"
"covid_test_swab_pos_first_date"
"received_ox_date"
"result_mk_date"
"samples_taken_date"
"sympt_now_date"
"travel_abroad_date"
"visit_date"

2) If the returning value is a date column, do we still want to include the date_filter_column? As there can be multiple rows per patient, we'll need to return either the first or last match in the period, so we will need to know which column we're using for that matching - we could assume it is the same one for date variables.

The example date variable above doesn't have a date_filter_column, but would it be possible that you want to e.g. filter by "visit_date", but return "travel_abroad_date"?

3) Is this a restricted dataset like ISARIC, that should fail local runs and github tests?

rebkwok commented 2 years ago

More questions: 4) Should the table in TPP contain all the yellow-highlighted field from the data spreadsheet? It contains some but not all of the yellow-highlighted fields, and the odd field that isn't highlighted.

This spreadsheet shows all the included fields.

5) There are quite a lot of fields that are coded numerically to categories (more than just yes/no to 1/0). Mostly, the codes are ints, a few are strings. There are a couple of issues here: a) If the column type is int, missing values will be returned as 0, which could be misinterpreted as a category value. We can get around this by just casting all those fields to strings (so missing values are returned as empty strings). b) Category codes may be useful in some cases - e.g. it's simpler to filter your extracted dataset by a code (whether that's an int or a string), but they aren't very human-readable. e.g. covid_test_blood_result is coded as:

[0] Any tests negative, but none positive;
[1] One or more positive test(s);
[2] Waiting for all results;
- [9] All Tests failed What's the best way to deal with these? We can just return the coded value and link to some lookup table in the docs for this dataset. We could convert them in the extracted dataset to their category values, so we return the longform strings. We could maybe have a flag in the variable function to decode category returning values.

alexwalkerepi commented 2 years ago

A general thing to note here is that the ONS CIS dataset is more liable to change than e.g. the primary care data, so it's worth leaving room for adding/changing which variables are specified for such things. Simon tackled this recently with the data for #614. I think this involved querying the raw data to get metadata on the types, which makes for easier updating it if the underlying data changes.

Yes I think it would be better if any of the columns were able to be used to filter.
Yes good point, we'd still potentially need to use a date_filter_column when returning a date, and it may be different from the date being returned.
Yes this needs to have the same warning/restriction as with other restricted datasets.
It's possible that TPP have stripped out some identifying variables, but also possible that ONS haven't sent all the variables in the spreadsheet. There was some discussion a week or two ago with TPP about additional variables that had been sent but hadn't yet been updated in the DB. I can chase whether this has happened in Slack and copy you in.
I think the ideal would be to return the longform strings but typed as categories rather than strings? That won't make a difference to output CSVs, but will to binary output formats

rebkwok commented 2 years ago

I'm using the same general approach as for ISARIC, in that returning can specific any of the currently available columns. It's less complicated in some ways, because the columns are typed already (ISARIC was all varchar, including dates).

If we convert to longform strings for the queries, the output should be category type automatically.

Thanks, think I have enough info to be getting on with, if any additional variables need to be added in that shouldn't be a big deal

rebkwok commented 2 years ago

@alexwalkerepi There are some duplicate Visit IDs (120 in total) in the ONS CIS dataset.

Some are complete duplicate rows, which are fine, we can just remove any that are entirely duplicate.

Some are duplicated across patients - i.e. the same visit_id appears with 2 different Patient_IDs. All the ones I've checked were duplicated in all other column values, so it's just like the visit data has been entered twice for two different patients, although I haven't checked all of them. Is it expected that a visit id could be duplicated across patients? I'm not sure exactly how this data is collected, could a single visit_id legimately cover two patients at the same time?

alexwalkerepi commented 2 years ago

I can clarify the duplicates issue with TPP/ONS, as that does sound odd having two patient IDs for the same data.

rebkwok commented 2 years ago

On closer inspection, I think it's unlikely the duplicate visit ID across patients is due to simple data entry errors. It tends to happen multiple times for pairs/groups of patients - e.g. patient 1 and patient 2 have rows with the same visit ID, for 4 or 5 separate visits. It would be good to clarify this though.

rebkwok commented 2 years ago

@alexwalkerepi

Yes good point, we'd still potentially need to use a date_filter_column when returning a date, and it may be different from the date being returned.

In my current implementation, if no date_filter_column is specified, we filter by the returning column, if that column is a date column, otherwise by visit date. Is this the most sensible defaults? I'm wondering if that's confusing if you're returning a date variable. E.g. if I'm returning travel_abroad_date is the most reasonable default that it filters by travel_abroad_date or by visit_date?

Should we require users to be explicit and always provide a date_filter_column (it will be required for selecting first/last in all returning values except number_of_matches_in_period)? Or we could always default to "visit_date" (even when returning is a different date) and state that in the docs.

alexwalkerepi commented 2 years ago

I think requiring an explicit date_filter_column would be best, except where returning number_of_matches_in_period if that's possible? It's possibly sometimes more repetitive, but reduces the chances of misunderstanding how things are filtered.

opensafely-core / cohort-extractor