Open wjchulme opened 3 years ago
Update:
It appears that the dates aren't consistent because of daylight saving, rather than being determined by specific patients.variable()
extractor functions.
2021-05-15
.2020-03-27 00:00:00
for GMT and 2021-03-28 01:00:00
for BSTI'm seeing this when importing a feather file as a data.frame in R. I don't know if this behaviour is similar in other languages.
This causes an error when providing custom dummy data as a feather file in the extraction action -- the custom dummy data date variables are date-only, but the dates returned by cohort extractor are (sometimes) date-time, so it throws an error.
A workaround is to check the custom dummy data is consistent with the study definition manually (as I was doing before), making sure to ignore different date types. But this can't be a long-term solution.
Ignore the problem until cohort extractor v2 comes along, where variable types will be more consistent / easier to manage / automatically documented.... maybe!
Yes, they will be! However, it's unlikely v2 will be switched to the recommended way of doing things for everyone before the end of the year
@evansd says:
I think the only reason we using datetimes as opposed to dates at all is to do with pandas limitations. We never export anything as less than day granularity. Assuming UTC for everything might work. I think it depends on whether the consumer code would then try to turn those into local datetimes.
In fact I suspect this is a consumer side issue in the first place as I'm pretty sure we're not embedding any kind of timezone information into the output files at all.
It's not always obvious what format each variable will take (boolean, character, date,...) when data are extracted to a feather/dta file, so it would be nice for this to be documented somewhere so there's no guesswork. For example, are binary flags
0/1
integers orTrue/False
booleans? Are dates character-dates"2020-01-01"
, dates2020-01-01 GMT
or date-times2020-01-01 00:00:00 GMT
?We can't just have a short table saying that every "binary_flag" is boolean, every "date" is date, etc. because:
returning
argument (egpatients.most_recent_bmi
orpatients.categorised_as
) so the type is determined by the variable itself, not areturning
argument.patients.admitted_to_hospital(returning = "date_admitted")
returns2021-05-15 01:00:00 BST
butpatients.with_tpp_vaccination_record(returning="date")
returns2021-05-15 GMT
.~ EDIT: this is due to BST/GMT timezone issues, see comment below.Possible solutions:
patients.py
doc-strings for each variable. This is quite a lot of tedious work to do up-front and then maintain if there are changes.YYYY-MM-DD
. Then we can use doc-strings to document the type for variables without a returning argument, and refer to a table for everything else.Related issue: https://github.com/opensafely-core/cohort-extractor/issues/434