Add documentation on study variable types when outputting to binary files

wjchulme commented 3 years ago

It's not always obvious what format each variable will take (boolean, character, date,...) when data are extracted to a feather/dta file, so it would be nice for this to be documented somewhere so there's no guesswork. For example, are binary flags 0/1 integers or True/False booleans? Are dates character-dates "2020-01-01", dates 2020-01-01 GMT or date-times 2020-01-01 00:00:00 GMT?

We can't just have a short table saying that every "binary_flag" is boolean, every "date" is date, etc. because:

Not all variables have a returning argument (eg patients.most_recent_bmi or patients.categorised_as) so the type is determined by the variable itself, not a returning argument.
~The types aren't always consistent, for instance patients.admitted_to_hospital(returning = "date_admitted") returns 2021-05-15 01:00:00 BST but patients.with_tpp_vaccination_record(returning="date") returns 2021-05-15 GMT.~ EDIT: this is due to BST/GMT timezone issues, see comment below.

Possible solutions:

Add info on return types in the patients.py doc-strings for each variable. This is quite a lot of tedious work to do up-front and then maintain if there are changes.
Force consistency, for instance if we're confident that (for now) we don't need the time component for any of the date variables then all dates can be returned as pure date YYYY-MM-DD. Then we can use doc-strings to document the type for variables without a returning argument, and refer to a table for everything else.
Ignore the problem until cohort extractor v2 comes along, where variable types will be more consistent / easier to manage / automatically documented.... maybe!

wjchulme commented 3 years ago

Update:

It appears that the dates aren't consistent because of daylight saving, rather than being determined by specific patients.variable() extractor functions.

If a date variable contains dates all within the same timezone then variable will not have a time component, eg 2021-05-15.
If the variable includes dates occurring either side of BST/GMT, then it will have a time component, eg 2020-03-27 00:00:00 for GMT and 2021-03-28 01:00:00 for BST

I'm seeing this when importing a feather file as a data.frame in R. I don't know if this behaviour is similar in other languages.

This causes an error when providing custom dummy data as a feather file in the extraction action -- the custom dummy data date variables are date-only, but the dates returned by cohort extractor are (sometimes) date-time, so it throws an error.

A workaround is to check the custom dummy data is consistent with the study definition manually (as I was doing before), making sure to ignore different date types. But this can't be a long-term solution.

sebbacon commented 3 years ago

Ignore the problem until cohort extractor v2 comes along, where variable types will be more consistent / easier to manage / automatically documented.... maybe!

Yes, they will be! However, it's unlikely v2 will be switched to the recommended way of doing things for everyone before the end of the year

sebbacon commented 3 years ago

@evansd says:

I think the only reason we using datetimes as opposed to dates at all is to do with pandas limitations. We never export anything as less than day granularity. Assuming UTC for everything might work. I think it depends on whether the consumer code would then try to turn those into local datetimes.

In fact I suspect this is a consumer side issue in the first place as I'm pretty sure we're not embedding any kind of timezone information into the output files at all.

opensafely-core / cohort-extractor

Add documentation on study variable types when outputting to binary files #611