opensafely / documentation

Documentation for the OpenSAFELY platform
https://docs.opensafely.org
Other
35 stars 6 forks source link

Miscellaneous Data Builder tutorial feedback #999

Closed benbc closed 1 year ago

benbc commented 2 years ago

This isn't terribly coherent, but I wanted to get it down somewhere while it was fresh in my mind.

Several of these notes are about the realism of the code. I understand that it isn't intended to be completely realistic, but I think it's worth removing as many of these roadbumps as possible to make it seem less weird to researchers.

The Python installation instructions (here are wrong. The URL should end #egg=opensafely-databuilder.

Installing the Python package may not work on Windows. For the researcher I was working with it resulted in a too-long-file-path (trying to unpack the file tests/acceptance/comparative_booster_study/codelists/opensafely-covid-identification-in-primary-care-suspected-covid-nonspecific-clinical-assessment.csv into C:\Users\lsh1803144\AppData\Local\Temp\pip-install-eneil7vg\opensafely-databuilder_edf74569c16147e8ad6b040d32231f7d).

I'm unsure about using cohort as a variable name for the population (e.g. here, but throughout). I think that it's strictly correct, but it risks confusion with Cohort Extractor (which maybe used the term loosely) and begs the question of why the method isn't Dataset.set_cohort().

Using date_end here raises interpretation questions that would be avoided if we used date_start instead ("what if they haven't moved house?").

This conversion to the end of the previous month seems really contrived. Can we motivate it at all? Or come up with a more realistic example for date arithmetic? (Which I assume is the reason it's here.)

This variable name is so clumsy that it makes it hard to understand. Maybe last_day_of_month_before_first_hospitalization?

The drop() here is redundant, which makes it harder to understand what is going on.

StevenMaude commented 2 years ago

@benbc: Thanks for writing up your feedback, and linking to specific examples; it's all easy to understand, and I think your suggestions are good :+1:

I'll split the comments into topics.

I've added the documentation-related tasks to #987 and will close this issue when they are complete.

Installation

The Python installation instructions (here are wrong. The URL should end #egg=opensafely-databuilder.

Thanks :eyes: I'll fix this!

Installing the Python package may not work on Windows. For the researcher I was working with it resulted in a too-long-file-path (trying to unpack the file tests/acceptance/comparative_booster_study/codelists/opensafely-covid-identification-in-primary-care-suspected-covid-nonspecific-clinical-assessment.csv into C:\Users\lsh1803144\AppData\Local\Temp\pip-install-eneil7vg\opensafely-databuilder_edf74569c16147e8ad6b040d32231f7d).

I've opened a Data Builder issue for this.

This is another reason that we should be running with opensafely anyway. It's possible; I should probably just try and get https://github.com/opensafely-core/databuilder/pull/703 merged in its current state, even if it's not finalised.

cohort naming

I'm unsure about using cohort as a variable name for the population (e.g. here, but throughout). I think that it's strictly correct, but it risks confusion with Cohort Extractor (which maybe used the term loosely) and begs the question of why the method isn't Dataset.set_cohort().

@CarolineMorton already mentioned this to me, citing that it will prime researchers to think of specific types of study, which is not what we want to do. Her suggestion was to change cohort to study_population; if you have a better idea, feel free to propose it.

Contrived examples

I agree that some of the examples are fairly contrived, particularly earlier on. I tried to impose a constraint of not introducing anything that wasn't previously discussed or in the same page. That makes writing early examples that are still somehow relevant or meaningful a little bit tricky.

Using date_end here raises interpretation questions that would be avoided if we used date_start instead ("what if they haven't moved house?").

I'll look over it again, I might not have had any good reason for using date_start over date_end; maybe that'll simple substitution will just work!

This conversion to the end of the previous month seems really contrived. Can we motivate it at all? Or come up with a more realistic example for date arithmetic? (Which I assume is the reason it's here.)

There's probably a better example we can devise, though I couldn't think of one at the time, and just stuck that there in lieu of better.

This variable name is so clumsy that it makes it hard to understand. Maybe last_day_of_month_before_first_hospitalization?

Yes, that's preferable, if we indeed keep that as is, depending on the above issue with date arithmetic :arrow_up:

The drop() here is redundant, which makes it harder to understand what is going on.

It was to try and show the syntax for method chaining with take and drop, while still keeping that example simple. The other use of take then combines lots of conditions. I'll come up with something else; perhaps including missing values and dropping those? [^1]

[^1]: Incidentally, that prompts another thought I've had: do we have particular opinions on the most idiomatic ways of writing ehrQL? There are definitely cases where there are multiple ways to go at selecting the same rows.

benbc commented 2 years ago

Re cohort naming.

I would suggest just population instead. It's shorter and we're generally avoiding the use of "study" in Data Builder.

benbc commented 2 years ago

For the simplified take/drop example, may be drop based on date?

StevenMaude commented 1 year ago

Moved to opensafely-core/databuilder#1051.