Open evansd opened 2 years ago
Have created #533 to cover part of the first point.
Having just deliberately tested things out to break them, I'll add an example of a simple case so there's a real example of the current state:
Traceback (most recent call last):
File "databuilder/ehrql-tutorial-examples/3a_multiple2_dataset_definition.py", line 22, in <module>
latest_imd_is_at_least_5000 = latest_imd >= 5000.0
databuilder.query_model.TypeValidationError: GE.rhs requires 'databuilder.query_model.Series[int]' but got 'databuilder.query_model.Series[float]'
In node:
Function.GE(lhs=SelectColumn(source=PickOneRowPerPatient(source=Sort(source=SelectTable(name='patient_address', schema=TableSchema(patientaddress_id=int, date_start=datetime.date, date_end=datetime.date, index_of_multiple_deprivation_rounded=int, has_postcode=bool)), sort_by=SelectColumn(source=SelectTable(name='patient_address', schema=TableSchema(patientaddress_id=int, date_start=datetime.date, date_end=datetime.date, index_of_multiple_deprivation_rounded=int, has_postcode=bool)), name='date_end')), position=Position.LAST), name='index_of_multiple_deprivation_rounded'), rhs=Value(value=5000.0))
It's not actually too bad, except that:
Series
, which as things stand, I'll mention in the documentation, but not sure if we want to expose this at all to users, and keep it entirely to the internal workingsAnother example of a less helpful error:
(please ignore the slightly nonsensical specifics of what I'm actually doing here)
dataset.set_population(
patient_demographics.date_of_birth.is_not_null()
| patient_demographics.date_of_birth.is_null
)
gives:
Traceback (most recent call last):
File "databuilder/ehrql-tutorial-examples/3a_multiple2_dataset_definition.py", line 18, in <module>
patient_demographics.date_of_birth.is_not_null()
databuilder.query_model.TypeValidationError: Or.rhs requires 'databuilder.query_model.Series[bool]' but got 'databuilder.query_model.Series[method]'
In node:
…
The traceback correctly points you to the right hand side of the expression, but the line shown is the left hand side.
Error message for setting variables to things that aren't series is unhelpful.
Dataset().fish = 1
gives
Traceback (most recent call last):
File "/home/benbc/src/opensafely-core/databuilder/tests/unit/test_query_language.py", line 109, in test_accessing_unassigned_variable_gives_helpful_error
Dataset().fish = 1
File "/home/benbc/src/opensafely-core/databuilder/databuilder/query_language.py", line 49, in __setattr__
raise TypeError(
TypeError: Invalid variable 'fish'. Dataset variables must be values not whole rows
Sort before filter gives a key error:
from databuilder.tables.beta.tpp import appointments
last_appointment = appointments \
.sort_by(appointments.start_date) \
.take(appointments.start_date >= "2020-02-02") \
.last_for_patient()
File "/tmp/pytest-of-benbc/pytest-16/test_generate_dataset0/dataset.py", line 11, in <module>
last_appointment = appointments .sort_by(appointments.start_date) .take(appointments.start_date >= "2020-02-02") .last_for_patient()
KeyError: 'take'
From some internal user observation:
Failed to import 'analysis/dataset_definition.py'
See #1401
Another example:
The mistake here is that I'm using --dummy-data
when I meant to use --dummy-tables
. But the error message is very unhelpful:
opensafely exec ehrql:v0 generate-dataset "./analysis/dataset_definition.py" --dummy-data "./example-data"
2023-08-09 12:12:11 [info ] Compiling dataset definition from analysis/dataset_definition.py [ehrql.main]
2023-08-09 12:12:11 [info ] Generating dummy dataset [ehrql.main]
2023-08-09 12:12:11 [info ] Reading dummy data from example-data [ehrql.main]
Traceback (most recent call last):
File "/opt/venv/bin/ehrql", line 8, in <module>
sys.exit(entrypoint())
^^^^^^^^^^^^
File "/app/ehrql/__main__.py", line 49, in entrypoint
return main(sys.argv[1:], environ=os.environ) # pragma: no cover
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/ehrql/__main__.py", line 88, in main
function(**kwargs)
File "/app/ehrql/main.py", line 64, in generate_dataset
generate_dataset_with_dummy_data(
File "/app/ehrql/main.py", line 99, in generate_dataset_with_dummy_data
reader = read_dataset(dummy_data_file, column_specs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/ehrql/file_formats/__init__.py", line 34, in read_dataset
raise ValidationError(f"Unsupported file type: {extension}")
ehrql.file_formats.validation.ValidationError: Unsupported file type:
I think that --dummy-data
only accepts a file, and --dummy-tables
only accepts a path. So we might be able to provide a better error message by checking this before validating the data (and failing with a mysterious error).
(There's another edge case there where naming your directory ends with .csv
then gives you: Missing file: example-data.csv
.)
Opened #1485 for this.
More generally, dummy data file validation errors could be more user friendly, and eliminate confusing parts of the stack trace like we do elsewhere. e.g. a missing column in a dummy data file prints lots of stack trace, but the only part the user needs to look at is the very last line, which has a helpful message:
@rebkwok ➜ /workspaces/ehrql-tutorial (main) $ opensafely exec ehrql:v0 generate-dataset dataset_definition.py --dummy-data-file dummy.csv
2023-11-13 13:09:52 [info ] Compiling dataset definition from dataset_definition.py
2023-11-13 13:09:52 [info ] Generating dummy dataset
2023-11-13 13:09:52 [info ] Reading dummy data from dummy.csv
Traceback (most recent call last):
File "/opt/venv/bin/ehrql", line 8, in <module>
sys.exit(entrypoint())
^^^^^^^^^^^^
File "/app/ehrql/__main__.py", line 67, in entrypoint
return main(sys.argv[1:], environ=os.environ) # pragma: no cover
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/ehrql/__main__.py", line 92, in main
function(**kwargs)
File "/app/ehrql/main.py", line 67, in generate_dataset
generate_dataset_with_dummy_data(
File "/app/ehrql/main.py", line 111, in generate_dataset_with_dummy_data
reader = read_dataset(dummy_data_file, column_specs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/ehrql/file_formats/__init__.py", line 39, in read_dataset
return reader(filename, column_specs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/ehrql/file_formats/base.py", line 18, in __init__
self._validate_basic()
File "/app/ehrql/file_formats/csv.py", line 66, in _validate_basic
for _ in zip(self, range(10)):
File "/app/ehrql/file_formats/csv.py", line 73, in __iter__
validate_columns(headers, self.column_specs.keys())
File "/app/ehrql/file_formats/base.py", line 67, in validate_columns
raise ValidationError(f"Missing columns: {', '.join(missing)}")
ehrql.file_formats.base.ValidationError: Missing columns: asthma_med_code
Whereas for an invalid date format (the dummy data file contained "2023-02-30" in a date column), the error is not so helpful:
@rebkwok ➜ /workspaces/ehrql-tutorial (main) $ opensafely exec ehrql:v0 generate-dataset dataset_definition.py --dummy-data-file dummy.csv
2023-11-13 12:59:36 [info ] Compiling dataset definition from dataset_definition.py
2023-11-13 12:59:36 [info ] Generating dummy dataset
2023-11-13 12:59:36 [info ] Reading dummy data from dummy.csv
Traceback (most recent call last):
File "/app/ehrql/file_formats/csv.py", line 135, in parser
return convertor(value)
^^^^^^^^^^^^^^^^
ValueError: day is out of range for month
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/app/ehrql/file_formats/csv.py", line 77, in __iter__
yield row_parser(row)
^^^^^^^^^^^^^^^
File "/app/ehrql/file_formats/csv.py", line 104, in row_parser
return tuple(parser(row) for parser in parsers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/ehrql/file_formats/csv.py", line 104, in <genexpr>
return tuple(parser(row) for parser in parsers)
^^^^^^^^^^^
File "/app/ehrql/file_formats/csv.py", line 137, in parser
raise ValueError(f"column {name!r}: {e}")
ValueError: column 'asthma_med_date': day is out of range for month
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/venv/bin/ehrql", line 8, in <module>
sys.exit(entrypoint())
^^^^^^^^^^^^
File "/app/ehrql/__main__.py", line 67, in entrypoint
return main(sys.argv[1:], environ=os.environ) # pragma: no cover
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/ehrql/__main__.py", line 92, in main
function(**kwargs)
File "/app/ehrql/main.py", line 67, in generate_dataset
generate_dataset_with_dummy_data(
File "/app/ehrql/main.py", line 111, in generate_dataset_with_dummy_data
reader = read_dataset(dummy_data_file, column_specs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/ehrql/file_formats/__init__.py", line 39, in read_dataset
return reader(filename, column_specs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/ehrql/file_formats/base.py", line 18, in __init__
self._validate_basic()
File "/app/ehrql/file_formats/csv.py", line 66, in _validate_basic
for _ in zip(self, range(10)):
File "/app/ehrql/file_formats/csv.py", line 79, in __iter__
raise ValidationError(f"row {n}: {e}")
ehrql.file_formats.base.ValidationError: row 1: column 'asthma_med_date': day is out of range for month
Using a case
statement, but forgetting to select a column in the then
part results in a particularly unfriendly error message:
e.g. this is fine
>>> case(when(patients.sex == "male").then(patients.sex))
1 | None
2 | male
3 | male
4 | None
5 | male
6 | None
7 | male
8 | None
9 | None
10 | male
But forgetting to select patients.sex
as the then
clause does this:
>>> case(when(patients.sex == "male").then(patients))
Traceback (most recent call last):
File "/app/ehrql/query_language.py", line 936, in _wrap
cls = REGISTERED_TYPES[type_, is_patient_level]
~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: (<class 'ehrql.tables.beta.core.patients'>, True)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<console>", line 1, in <module>
AssertionError
The error about "unrelated domains" is pretty scary-looking in its verbosity (lots of query model output) and unclear to a novice what to do about it:
Hi, I have an ehrQL query if anyone can help... I'm trying to update our dataset_definition for the next stage of analysis, and its generating an error in Visual Studio that is seemingly linked to an existing piece of code, and not something new that I've added. The error is as shown below. https://github.com/opensafely/end-of-life-carequality/tree/Descriptive_count_edits_MD Thank you!
logs:
2023-12-08 12:35:38 [info ] Compiling dataset definition from analysis/os_reports/dataset_definition_os_reports.py
Failed to import 'analysis/os_reports/dataset_definition_os_reports.py':
Traceback (most recent call last):
File "/workspace/analysis/os_reports/dataset_definition_os_reports.py", line 41, in <module>
practice_registrations.where(practice_registrations.start_date <= dod_ons)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ehrql.query_model.nodes.DomainMismatchError: Attempt to combine unrelated domains:
{Domain(lineage=('PatientDomain', SelectTable(name='practice_registrations', schema=TableSchema(start_date=Column(datetime.date, constraints=(Constraint.NotNull(),)), end_date=Column(datetime.date, constraints=()), practice_pseudo_id=Column(int, constraints=(Constraint.NotNull(),)), practice_stp=Column(str, constraints=(Constraint.Regex(regex='E540000[0-9]{2}'),)), practice_nuts1_region_name=Column(str, constraints=(Constraint.Categorical(values=('North East', 'North West', 'Yorkshire and The Humber', 'East Midlands', 'West Midlands', 'East', 'London',
'South East', 'South West')),)))))), Domain(lineage=('PatientDomain', SelectTable(name='ons_deaths', schema=TableSchema(date=Column(datetime.date, constraints=()), place=Column(str, constraints=(Constraint.Categorical(values=('Care Home', 'Elsewhere', 'Home', 'Hospice', 'Hospital', 'Other communal establishment')),)), underlying_cause_of_death=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_01=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_02=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_03=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_04=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_05=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_06=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_07=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_08=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_09=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_10=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_11=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_12=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_13=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_14=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_15=Column(ehrql.codes.ICD10Code, constraints=())))))}
(source)
Possibly now addressed in https://github.com/opensafely-core/ehrql/pull/1873
This is a grab bag of things that will no doubt need splitting into separate issues once we decide what we're doing here.
There are libraries which improve on Python's default traceback and exception display. Given that we control the context in which the dataset definition file is imported, we can ensure one of these libraries is configured at the point where we import the user's code. One thing we could do would be to hide traceback lines involving databuilder code and point directly to the line of user code which triggered it. But something like
friendly
provides many more options for customising what we show the user. https://github.com/friendly-traceback/friendly-traceback https://github.com/Qix-/better-exceptionsThe query model is the ultimate enforcer of validity, but the errors it throws are often not going to be that helpful to the user. We should consider catching these inside the
query_language._wrap
function and translating them into something more friendly. This will probably involve changing the query model errors to be more structured (i.e. include additional attributes beyond a string message) so ehrQL can translate them better.The refactored ehrQL implementation has a distinct Event and Patient Series class for every type of thing (int, date, CTV3Code etc) you can have in a series. This means that a given series object will only have methods corresponding to valid operations for that series. That's a good thing. However it would be very useful to communicate to the user whether they've just typoed a method name (e.g.
values.sum_for_pateint()
) or whether they've used a method which does exist in general, just not for the specific type in question (e.gsome_date_series.sum_for_patient()
).There's further complexity here because we use attribute access to select columns from frames, so we need to distinguish:
frame.bad_column_name
frame.sortby()
not_sorted_frame.first_for_patient()