opensafely-core / cohort-extractor

Cohort extractor tool which can generate dummy data, or real data against OpenSAFELY-compliant research databases
Other
38 stars 13 forks source link

Add documentation on study variable types when outputting to binary files #611

Open wjchulme opened 3 years ago

wjchulme commented 3 years ago

It's not always obvious what format each variable will take (boolean, character, date,...) when data are extracted to a feather/dta file, so it would be nice for this to be documented somewhere so there's no guesswork. For example, are binary flags 0/1 integers or True/False booleans? Are dates character-dates "2020-01-01", dates 2020-01-01 GMT or date-times 2020-01-01 00:00:00 GMT?

We can't just have a short table saying that every "binary_flag" is boolean, every "date" is date, etc. because:

Possible solutions:

Related issue: https://github.com/opensafely-core/cohort-extractor/issues/434

wjchulme commented 3 years ago

Update:

It appears that the dates aren't consistent because of daylight saving, rather than being determined by specific patients.variable() extractor functions.

I'm seeing this when importing a feather file as a data.frame in R. I don't know if this behaviour is similar in other languages.

This causes an error when providing custom dummy data as a feather file in the extraction action -- the custom dummy data date variables are date-only, but the dates returned by cohort extractor are (sometimes) date-time, so it throws an error.

A workaround is to check the custom dummy data is consistent with the study definition manually (as I was doing before), making sure to ignore different date types. But this can't be a long-term solution.

sebbacon commented 3 years ago

Ignore the problem until cohort extractor v2 comes along, where variable types will be more consistent / easier to manage / automatically documented.... maybe!

Yes, they will be! However, it's unlikely v2 will be switched to the recommended way of doing things for everyone before the end of the year

sebbacon commented 3 years ago

@evansd says:

I think the only reason we using datetimes as opposed to dates at all is to do with pandas limitations. We never export anything as less than day granularity. Assuming UTC for everything might work. I think it depends on whether the consumer code would then try to turn those into local datetimes.

In fact I suspect this is a consumer side issue in the first place as I'm pretty sure we're not embedding any kind of timezone information into the output files at all.