HBA1c codes - Githubissues

iaindillingham commented 3 years ago

When two codes are recorded on the same day for a patient in the TPP data, are they recorded as one code or two codes in the data generated by cohortextractor?

TL;DR. One code.

But which code? The code with the most recent consultation date. If two codes have identical consultation dates, then they are sorted by CodedEvent_ID in ascending order and the first code is recorded in the data generated by cohortextractor.

If two codes have identical consultation dates, can we say whether old code or new code takes precedence? I don't think we can. If we assume that CodedEvent_ID is an automatically-incrementing primary key (as the database schema suggests), then the first code will be that with the lowest value of CodedEvent_ID. If a given laboratory always returns codes in the same order, then the first code will be the same each time. However, we don't know whether that means old code will be the same each time or new code will be the same each time.

Although this issue describes HBA1c codes, it applies anywhere we call patients.with_these_clinical_events in a study definition.

iaindillingham commented 3 years ago

The code with the most recent consultation date.

This is the default. However, it's configurable: pass patients.with_these_clinical_events(find_first_match_in_period=True, ...) to return the code with the earliest consultation date. (In our case, however, if two codes have identical consultation dates, then the same code will be returned when passing find_first_match_in_period and the default.)

HelenCEBM commented 3 years ago

Thanks for investigating!

There are several options for what to return for patients.with_these_clinical_events() and I think the behaviour you are describing occurs when returning code or numeric_value: you can only ever get one, because results are always one-line-per-patient, but which one you will get obeys the rules you described above.

However:

If returning number_of_matches_in_period you should get 2.
If returning number_of_episodes you should always get 1 because the minimum episode length you can set is 0 days, so events happening on the same day will count as part of the same 'episode'. But events separated by a day or more can be grouped together here if required by adjusting the episode length.

iaindillingham commented 3 years ago

As an aside, and for @LFISHER7, there isn't a straightforward way to inspect the SQL that's executed against either the TPP or the EMIS backends and, consequently, unpick a study definition (see opensafely-core/cohort-extractor#539).

For the moment:

Clone the cohort-extractor repository
Create a new virtual environment (Python 3.7) and pip install -r requirements.txt
Within your project directory, activate the above virtual environment
Run cohortextractor, noting that you pass the study definition as a dotted path. For example:

# For the TPP backend
TEMP_TABLE_PREFIX= DATABASE_URL=mssql:// cohortextractor dump_cohort_sql --study-definition analysis.study_definition > study_definition_tpp.sql

# For the EMIS backend
EMIS_ORGANISATION_HASH=eoh TEMP_TABLE_PREFIX= DATABASE_URL=presto:// cohortextractor dump_cohort_sql --study-definition analysis.study_definition > study_definition_emis.sql

iaindillingham commented 3 years ago

If returning number_of_episodes you should always get 1 because the minimum episode length you can set is 0 days, so events happening on the same day will count as part of the same 'episode'. But events separated by a day or more can be grouped together here if required by adjusting the episode length.

Have you come across any documentation for episode_defined_as, which is where I think the episode length is defined, @HelenCEBM?

opensafely / SRO-Measures

HBA1c codes #10