Closed benbc closed 1 year ago
@benbc: Thanks for writing up your feedback, and linking to specific examples; it's all easy to understand, and I think your suggestions are good :+1:
I'll split the comments into topics.
I've added the documentation-related tasks to #987 and will close this issue when they are complete.
The Python installation instructions (here are wrong. The URL should end #egg=opensafely-databuilder.
Thanks :eyes: I'll fix this!
Installing the Python package may not work on Windows. For the researcher I was working with it resulted in a too-long-file-path (trying to unpack the file tests/acceptance/comparative_booster_study/codelists/opensafely-covid-identification-in-primary-care-suspected-covid-nonspecific-clinical-assessment.csv into C:\Users\lsh1803144\AppData\Local\Temp\pip-install-eneil7vg\opensafely-databuilder_edf74569c16147e8ad6b040d32231f7d).
I've opened a Data Builder issue for this.
This is another reason that we should be running with opensafely
anyway. It's possible; I should probably just try and get https://github.com/opensafely-core/databuilder/pull/703 merged in its current state, even if it's not finalised.
cohort
namingI'm unsure about using
cohort
as a variable name for the population (e.g. here, but throughout). I think that it's strictly correct, but it risks confusion with Cohort Extractor (which maybe used the term loosely) and begs the question of why the method isn'tDataset.set_cohort()
.
@CarolineMorton already mentioned this to me, citing that it will prime researchers to think of specific types of study, which is not what we want to do. Her suggestion was to change cohort
to study_population
; if you have a better idea, feel free to propose it.
I agree that some of the examples are fairly contrived, particularly earlier on. I tried to impose a constraint of not introducing anything that wasn't previously discussed or in the same page. That makes writing early examples that are still somehow relevant or meaningful a little bit tricky.
Using
date_end
here raises interpretation questions that would be avoided if we useddate_start
instead ("what if they haven't moved house?").
I'll look over it again, I might not have had any good reason for using date_start
over date_end
; maybe that'll simple substitution will just work!
This conversion to the end of the previous month seems really contrived. Can we motivate it at all? Or come up with a more realistic example for date arithmetic? (Which I assume is the reason it's here.)
There's probably a better example we can devise, though I couldn't think of one at the time, and just stuck that there in lieu of better.
This variable name is so clumsy that it makes it hard to understand. Maybe
last_day_of_month_before_first_hospitalization
?
Yes, that's preferable, if we indeed keep that as is, depending on the above issue with date arithmetic :arrow_up:
The
drop()
here is redundant, which makes it harder to understand what is going on.
It was to try and show the syntax for method chaining with take
and drop
, while still keeping that example simple. The other use of take
then combines lots of conditions. I'll come up with something else; perhaps including missing values and dropping those? [^1]
[^1]: Incidentally, that prompts another thought I've had: do we have particular opinions on the most idiomatic ways of writing ehrQL? There are definitely cases where there are multiple ways to go at selecting the same rows.
Re cohort naming.
I would suggest just population
instead. It's shorter and we're generally avoiding the use of "study" in Data Builder.
For the simplified take/drop example, may be drop based on date?
Moved to opensafely-core/databuilder#1051.
This isn't terribly coherent, but I wanted to get it down somewhere while it was fresh in my mind.
Several of these notes are about the realism of the code. I understand that it isn't intended to be completely realistic, but I think it's worth removing as many of these roadbumps as possible to make it seem less weird to researchers.
The Python installation instructions (here are wrong. The URL should end
#egg=opensafely-databuilder
.Installing the Python package may not work on Windows. For the researcher I was working with it resulted in a too-long-file-path (trying to unpack the file
tests/acceptance/comparative_booster_study/codelists/opensafely-covid-identification-in-primary-care-suspected-covid-nonspecific-clinical-assessment.csv
intoC:\Users\lsh1803144\AppData\Local\Temp\pip-install-eneil7vg\opensafely-databuilder_edf74569c16147e8ad6b040d32231f7d
).I'm unsure about using
cohort
as a variable name for the population (e.g. here, but throughout). I think that it's strictly correct, but it risks confusion with Cohort Extractor (which maybe used the term loosely) and begs the question of why the method isn'tDataset.set_cohort()
.Using
date_end
here raises interpretation questions that would be avoided if we useddate_start
instead ("what if they haven't moved house?").This conversion to the end of the previous month seems really contrived. Can we motivate it at all? Or come up with a more realistic example for date arithmetic? (Which I assume is the reason it's here.)
This variable name is so clumsy that it makes it hard to understand. Maybe
last_day_of_month_before_first_hospitalization
?The
drop()
here is redundant, which makes it harder to understand what is going on.