Draw the owl - Githubissues

iaindillingham commented 1 year ago

As @inglesp has pointed out, the ehrQL tutorial is similar to How to draw an owl.

How to draw an owl

Upon conclusion of the ehrQL tutorial, the reader has created a repo, created (and deleted) a codespace, interacted with the sandbox, created a minimal dataset definition, and generated a dummy dataset that is displayed in the terminal (i.e. it is not written to a file).

To become a competent user of ehrQL, however, the reader should also:

Expand the dataset definition
Write a dummy dataset to a file
Commit the dataset definition to main

Expand the dataset definition

I'd like to check with a couple of researchers about what "expand" most usefully means,^2 but based on this dataset definition, which @alschaffer said was written by her pilots without her help,^1 I think "expand" probably means:

Combining Boolean series to define the population (e.g. was born on or before a date and is alive and is either male or female; was registered with a practice on a date; was registered with a practice for a minimum of k days)
Adding some simple demographic variables, such as age and sex
Adding a complex demographic variable, such as ethnicity (codelist_from_csv)
Adding a complex socioeconomic variable, such as IMD quintile (case)
Deriving a variable, such as counting the number of medications within the last 30 days (.is_in, .is_on_or_between, days, .count_for_patient)

Write a dummy dataset to a file

The reader should add an associated action to project.yaml, which they will run with opensafely run [action]. They should compare and contrast run with exec, noticing that exec is good for eyeballing the data but run is good for developing downstream actions, especially when the dummy dataset isn't written to a CSV file.

Commit the dataset definition to `main`

Upon conclusion of the ehrQL tutorial, the reader will be at "Initial commit" and be ready to run the associated action on OpenSAFELY Jobs. (Crating a project and workspace, and using OpenSAFELY Jobs is out of scope.) Also, they will have created an artefact inside the codespace that persists outside the codespace.

The reader shouldn't commit the dataset definition to a feature branch and open a pull request, because different projects and different organizations have different guidelines about feature branches and pull requests.

sebbacon commented 1 year ago

Regarding "Expand the dataset definition": this reminds me of background research I've been doing in preparation for some Great Variables Library Thinking.

I've asked around a few times (example) what the most common variables are; and I've cross-referenced them with a bit of grep-foo, and I came up with this tentative list:

age bands (see Andrea docs for example)
ethnicity (of different flavours) (see Colm’s data report work)
IMD
NHS region
sex
bmi (raw number and categories)
smoking
covid infection/hospitalisation/vaccination (at the moment at least)
date of death (patients table vs ONSDeath table)
equivalent of patients.registered_as_of() and patients.registered_with_one_practice_between()
deregistration date
for service analytics we often have practice id
care home residence (how often is the care home variable updated?)
cause of death, ICD-10

Fundamentally, a peer-reviews and agreed common set of things like this, in the research template, is the core of a variables library. So I'm excited to see this happening!

iaindillingham commented 1 year ago

I'm putting together an extended dataset definition in this gist, with feedback in Slack.^1

iaindillingham commented 1 year ago

Thanks, @sebbacon. At the moment, the expanded dataset definition hits several of those. I don't think it can hit them all, but hitting several suggests that it will be useful.

sebbacon commented 1 year ago

I don't think it can hit them all

Devil's advocate: why not? If nearly every study includes all of them anyway:

It's didactically useful as it covers all common cases
It's pragmatically useful for the same reason
It helps extend our "best-practice" reach deeper into peoples' code

iaindillingham commented 1 year ago

Because it's a tutorial and not a how-to. Hitting all of them will make the tutorial longer, which means it will take more time to complete and more time to maintain. I think a more effective use of time would be to incorporate several into the tutorial and the remainder into how-tos, or, indeed, reusable variables.

sebbacon commented 1 year ago

Fair, I think I'm eliding our tutorial with our research template.

It leads me to ask if this part of the tutorial content might also live in the research template?

The familiarity when moving on from the tutorial could be helpful.

iaindillingham commented 1 year ago

It could, but I think that's a separate issue, so I've created opensafely/research-template#108.

opensafely-core / ehrql

Draw the owl #1633

Expand the dataset definition

Write a dummy dataset to a file

Commit the dataset definition to `main`

opensafely-core / ehrql

Draw the owl #1633

Expand the dataset definition

Write a dummy dataset to a file

Commit the dataset definition to main

Commit the dataset definition to `main`