Closed amirmehrkar closed 2 years ago
the schema is reference in the above data ingest template but here it is direct:
Updated with LASTEST 04-11 schema https://docs.google.com/spreadsheets/d/1IhcPWpgHUocFOkCcil0NE2FyyELCyiGd/edit?usp=sharing&ouid=102835591321385431966&rtpof=true&sd=true
I have been informed that the yellow fields are necessary for the current Mental health application (which has been approved) @sebbacon and the rest are highly likely to be necessary. so maybe we should work to ingest all?
I have asked ONS-CIS to discuss direct with TPP to match and send the data to TPP. What other dev implications etc do we need to think about?
Are we in a position to do this with EMIS now too? Ie shall we delay the extract to EMIS temporarily?
Thanks. Alex has been doing a great job helping shape up and scope the data for ingest for ISARIC; if possible we should follow the same process here.
I can do similar again for this, and should be able to get started with the schema linked above.
@amirmehrkar, who do you think is the best ONS contact if I have questions about the data structure or user requirements?
Dan has highlighted the fields yellow I believe now.
@alexwalkerepi suggest email both Daniel.Ayoubkhani@ons.gov.uk - who is doing first study - and caroline.youell@ons.gov.uk who supplied the schema. Probably keep TPP CC in / and edit the excel to clarify things as you find them out?
Dan application is project ID 24 in my (headache!) spreadsheet; I’ll send you link to it on slack as this issue is public and don’t want application form in public yet; the form google link is “public” to allow for group editing.
In consultation with @alexwalkerepi I've looked at the data dictionary.
My current understanding of the data is:
The difficulty is to reduce to one row per patient, we can presumably use field "visit_date" to identify fe the first visit or last visit. But participants could presumably have more than 2 visits, I am not sure what users need/ if my assumption that participants could have multiple visits (and consequently, multiple rows) is correct. Pseudocode and more detail to follow once I understand what visit should be returned.
I agree with Linda's interpretation, that we can do something similar to ISARIC where the user requests the column they want to return. The challenge of multiple rows per patient makes it trickier, especially given that there are multiple date fields that could be used. As well as visit_date
, there are other dates that could plausibly be used to filter the rows, like covid_test_swab_pos_first_date
and sympt_now_date
. Presumably this would mean that the user would also have to specify the date too. something like:
cis_visit_date=patients.with_an_ons_cis_record(
return_column="visit_date",
return_expectations={"incidence": 0.1, "date": {"earliest": "2020-02-01"}}
),
long_covid_cough=patients.with_an_ons_cis_record(
between=["2020-02-01", "2020-06-01"],
date_filter_column="visit_date",
find_first_match_in_period=True,
return_column="long_covid_cough",
return_expectations={
"rate": "universal",
"category": {
"ratios": {
"Yes": 0.7,
"No": 0.1,
"(Missing)": 0.2,
}
},
},
),
isaric_sysbp=patients.with_an_ons_cis_record(
between=["2020-02-01", "2020-06-01"],
date_filter_column="visit_date",
find_first_match_in_period=True,
return_column="ct_mean",
return_expectations={
"incidence": 0.8,
"float": {"distribution": "normal", "mean": 21, "stddev": 5},
},
),
@andrewscolm is analysis restriction added to this too?
Whoever picks this up - liase with @alexwalkerepi
@alexwalkerepi I have a few questions:
1) There are 11 date-type columns in the ONS_CIS data. Should we allow any of them to be specified as the date_filter_column
?
These are the date-type columns:
2) If the returning
value is a date column, do we still want to include the date_filter_column
? As there can be multiple rows per patient, we'll need to return either the first or last match in the period, so we will need to know which column we're using for that matching - we could assume it is the same one for date variables.
The example date variable above doesn't have a date_filter_column
, but would it be possible that you want to e.g. filter by "visit_date", but return "travel_abroad_date"?
3) Is this a restricted dataset like ISARIC, that should fail local runs and github tests?
More questions: 4) Should the table in TPP contain all the yellow-highlighted field from the data spreadsheet? It contains some but not all of the yellow-highlighted fields, and the odd field that isn't highlighted.
This spreadsheet shows all the included fields.
5) There are quite a lot of fields that are coded numerically to categories (more than just yes/no to 1/0). Mostly, the codes are ints, a few are strings. There are a couple of issues here:
a) If the column type is int, missing values will be returned as 0, which could be misinterpreted as a category value. We can get around this by just casting all those fields to strings (so missing values are returned as empty strings).
b) Category codes may be useful in some cases - e.g. it's simpler to filter your extracted dataset by a code (whether that's an int or a string), but they aren't very human-readable.
e.g. covid_test_blood_result
is coded as:
A general thing to note here is that the ONS CIS dataset is more liable to change than e.g. the primary care data, so it's worth leaving room for adding/changing which variables are specified for such things. Simon tackled this recently with the data for #614. I think this involved querying the raw data to get metadata on the types, which makes for easier updating it if the underlying data changes.
date_filter_column
when returning a date, and it may be different from the date being returned.categories
rather than strings? That won't make a difference to output CSVs, but will to binary output formatsI'm using the same general approach as for ISARIC, in that returning
can specific any of the currently available columns. It's less complicated in some ways, because the columns are typed already (ISARIC was all varchar, including dates).
If we convert to longform strings for the queries, the output should be category type automatically.
Thanks, think I have enough info to be getting on with, if any additional variables need to be added in that shouldn't be a big deal
@alexwalkerepi There are some duplicate Visit IDs (120 in total) in the ONS CIS dataset.
Some are complete duplicate rows, which are fine, we can just remove any that are entirely duplicate.
Some are duplicated across patients - i.e. the same visit_id appears with 2 different Patient_IDs. All the ones I've checked were duplicated in all other column values, so it's just like the visit data has been entered twice for two different patients, although I haven't checked all of them. Is it expected that a visit id could be duplicated across patients? I'm not sure exactly how this data is collected, could a single visit_id legimately cover two patients at the same time?
I can clarify the duplicates issue with TPP/ONS, as that does sound odd having two patient IDs for the same data.
On closer inspection, I think it's unlikely the duplicate visit ID across patients is due to simple data entry errors. It tends to happen multiple times for pairs/groups of patients - e.g. patient 1 and patient 2 have rows with the same visit ID, for 4 or 5 separate visits. It would be good to clarify this though.
@alexwalkerepi
Yes good point, we'd still potentially need to use a date_filter_column when returning a date, and it may be different from the date being returned.
In my current implementation, if no date_filter_column
is specified, we filter by the returning
column, if that column is a date column, otherwise by visit date. Is this the most sensible defaults? I'm wondering if that's confusing if you're returning a date variable. E.g. if I'm returning travel_abroad_date
is the most reasonable default that it filters by travel_abroad_date
or by visit_date
?
Should we require users to be explicit and always provide a date_filter_column
(it will be required for selecting first/last in all returning
values except number_of_matches_in_period
)? Or we could always default to "visit_date" (even when returning
is a different date) and state that in the docs.
I think requiring an explicit date_filter_column
would be best, except where returning number_of_matches_in_period
if that's possible? It's possibly sometimes more repetitive, but reduces the chances of misunderstanding how things are filtered.
ONS is now sharing ONS Community Infection Survey (ONS-CIS) data with TPP - see here.
ONS are having regular governance calls with OpenSAFELY to decide which studies to conduct against this ONS-CIS dataset.
Importantly:
We are due to identify a small handful of projects soon.
@alexwalkercebm is on the discussion committee alongside Laurie Tomlinson. @sebbacon is there a particular development workflow we need to use to go from ingestion; backend; analyse?