opensafely-core / ehrql

ehrQL: the electronic health record query language for OpenSAFELY
https://docs.opensafely.org/ehrql/
Other
7 stars 3 forks source link

More helpful error handling in ehrQL #505

Open evansd opened 2 years ago

evansd commented 2 years ago

This is a grab bag of things that will no doubt need splitting into separate issues once we decide what we're doing here.


There are libraries which improve on Python's default traceback and exception display. Given that we control the context in which the dataset definition file is imported, we can ensure one of these libraries is configured at the point where we import the user's code. One thing we could do would be to hide traceback lines involving databuilder code and point directly to the line of user code which triggered it. But something like friendly provides many more options for customising what we show the user. https://github.com/friendly-traceback/friendly-traceback https://github.com/Qix-/better-exceptions


The query model is the ultimate enforcer of validity, but the errors it throws are often not going to be that helpful to the user. We should consider catching these inside the query_language._wrap function and translating them into something more friendly. This will probably involve changing the query model errors to be more structured (i.e. include additional attributes beyond a string message) so ehrQL can translate them better.


The refactored ehrQL implementation has a distinct Event and Patient Series class for every type of thing (int, date, CTV3Code etc) you can have in a series. This means that a given series object will only have methods corresponding to valid operations for that series. That's a good thing. However it would be very useful to communicate to the user whether they've just typoed a method name (e.g. values.sum_for_pateint()) or whether they've used a method which does exist in general, just not for the specific type in question (e.g some_date_series.sum_for_patient()).

There's further complexity here because we use attribute access to select columns from frames, so we need to distinguish:

inglesp commented 2 years ago

Have created #533 to cover part of the first point.

StevenMaude commented 2 years ago

Having just deliberately tested things out to break them, I'll add an example of a simple case so there's a real example of the current state:

Traceback (most recent call last):
  File "databuilder/ehrql-tutorial-examples/3a_multiple2_dataset_definition.py", line 22, in <module>
    latest_imd_is_at_least_5000 = latest_imd >= 5000.0
databuilder.query_model.TypeValidationError: GE.rhs requires 'databuilder.query_model.Series[int]' but got 'databuilder.query_model.Series[float]'
In node:
Function.GE(lhs=SelectColumn(source=PickOneRowPerPatient(source=Sort(source=SelectTable(name='patient_address', schema=TableSchema(patientaddress_id=int, date_start=datetime.date, date_end=datetime.date, index_of_multiple_deprivation_rounded=int, has_postcode=bool)), sort_by=SelectColumn(source=SelectTable(name='patient_address', schema=TableSchema(patientaddress_id=int, date_start=datetime.date, date_end=datetime.date, index_of_multiple_deprivation_rounded=int, has_postcode=bool)), name='date_end')), position=Position.LAST), name='index_of_multiple_deprivation_rounded'), rhs=Value(value=5000.0))

It's not actually too bad, except that:

StevenMaude commented 2 years ago

Another example of a less helpful error:

(please ignore the slightly nonsensical specifics of what I'm actually doing here)

dataset.set_population(
    patient_demographics.date_of_birth.is_not_null()
    | patient_demographics.date_of_birth.is_null
)

gives:

Traceback (most recent call last):
  File "databuilder/ehrql-tutorial-examples/3a_multiple2_dataset_definition.py", line 18, in <module>
    patient_demographics.date_of_birth.is_not_null()
databuilder.query_model.TypeValidationError: Or.rhs requires 'databuilder.query_model.Series[bool]' but got 'databuilder.query_model.Series[method]'
In node:
…

The traceback correctly points you to the right hand side of the expression, but the line shown is the left hand side.

benbc commented 1 year ago

Error message for setting variables to things that aren't series is unhelpful.

Dataset().fish = 1

gives

Traceback (most recent call last):
  File "/home/benbc/src/opensafely-core/databuilder/tests/unit/test_query_language.py", line 109, in test_accessing_unassigned_variable_gives_helpful_error
    Dataset().fish = 1
  File "/home/benbc/src/opensafely-core/databuilder/databuilder/query_language.py", line 49, in __setattr__
    raise TypeError(
TypeError: Invalid variable 'fish'. Dataset variables must be values not whole rows
benbc commented 1 year ago

Sort before filter gives a key error:

from databuilder.tables.beta.tpp import appointments
last_appointment = appointments \
    .sort_by(appointments.start_date) \
    .take(appointments.start_date >= "2020-02-02") \
    .last_for_patient()
  File "/tmp/pytest-of-benbc/pytest-16/test_generate_dataset0/dataset.py", line 11, in <module>
    last_appointment = appointments     .sort_by(appointments.start_date)     .take(appointments.start_date >= "2020-02-02")     .last_for_patient()
KeyError: 'take'
sebbacon commented 1 year ago

From some internal user observation:

sebbacon commented 1 year ago

See #1401

StevenMaude commented 1 year ago

Another example:

The mistake here is that I'm using --dummy-data when I meant to use --dummy-tables. But the error message is very unhelpful:

opensafely exec ehrql:v0 generate-dataset "./analysis/dataset_definition.py" --dummy-data "./example-data"
2023-08-09 12:12:11 [info     ] Compiling dataset definition from analysis/dataset_definition.py [ehrql.main] 
2023-08-09 12:12:11 [info     ] Generating dummy dataset       [ehrql.main] 
2023-08-09 12:12:11 [info     ] Reading dummy data from example-data [ehrql.main] 
Traceback (most recent call last):
  File "/opt/venv/bin/ehrql", line 8, in <module>
    sys.exit(entrypoint())
             ^^^^^^^^^^^^
  File "/app/ehrql/__main__.py", line 49, in entrypoint
    return main(sys.argv[1:], environ=os.environ)  # pragma: no cover
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/ehrql/__main__.py", line 88, in main
    function(**kwargs)
  File "/app/ehrql/main.py", line 64, in generate_dataset
    generate_dataset_with_dummy_data(
  File "/app/ehrql/main.py", line 99, in generate_dataset_with_dummy_data
    reader = read_dataset(dummy_data_file, column_specs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/ehrql/file_formats/__init__.py", line 34, in read_dataset
    raise ValidationError(f"Unsupported file type: {extension}")
ehrql.file_formats.validation.ValidationError: Unsupported file type: 

I think that --dummy-data only accepts a file, and --dummy-tables only accepts a path. So we might be able to provide a better error message by checking this before validating the data (and failing with a mysterious error).

https://github.com/opensafely-core/ehrql/blob/a4f404dd6e26fc5c4a79e643468ae5ec2a7abb18/ehrql/main.py#L91-L106

(There's another edge case there where naming your directory ends with .csv then gives you: Missing file: example-data.csv.)

Opened #1485 for this.

rebkwok commented 11 months ago

More generally, dummy data file validation errors could be more user friendly, and eliminate confusing parts of the stack trace like we do elsewhere. e.g. a missing column in a dummy data file prints lots of stack trace, but the only part the user needs to look at is the very last line, which has a helpful message:

@rebkwok ➜ /workspaces/ehrql-tutorial (main) $ opensafely exec ehrql:v0 generate-dataset dataset_definition.py --dummy-data-file dummy.csv
2023-11-13 13:09:52 [info     ] Compiling dataset definition from dataset_definition.py
2023-11-13 13:09:52 [info     ] Generating dummy dataset
2023-11-13 13:09:52 [info     ] Reading dummy data from dummy.csv
Traceback (most recent call last):
  File "/opt/venv/bin/ehrql", line 8, in <module>
    sys.exit(entrypoint())
             ^^^^^^^^^^^^
  File "/app/ehrql/__main__.py", line 67, in entrypoint
    return main(sys.argv[1:], environ=os.environ)  # pragma: no cover
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/ehrql/__main__.py", line 92, in main
    function(**kwargs)
  File "/app/ehrql/main.py", line 67, in generate_dataset
    generate_dataset_with_dummy_data(
  File "/app/ehrql/main.py", line 111, in generate_dataset_with_dummy_data
    reader = read_dataset(dummy_data_file, column_specs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/ehrql/file_formats/__init__.py", line 39, in read_dataset
    return reader(filename, column_specs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/ehrql/file_formats/base.py", line 18, in __init__
    self._validate_basic()
  File "/app/ehrql/file_formats/csv.py", line 66, in _validate_basic
    for _ in zip(self, range(10)):
  File "/app/ehrql/file_formats/csv.py", line 73, in __iter__
    validate_columns(headers, self.column_specs.keys())
  File "/app/ehrql/file_formats/base.py", line 67, in validate_columns
    raise ValidationError(f"Missing columns: {', '.join(missing)}")
ehrql.file_formats.base.ValidationError: Missing columns: asthma_med_code

Whereas for an invalid date format (the dummy data file contained "2023-02-30" in a date column), the error is not so helpful:

@rebkwok ➜ /workspaces/ehrql-tutorial (main) $ opensafely exec ehrql:v0 generate-dataset dataset_definition.py --dummy-data-file dummy.csv
2023-11-13 12:59:36 [info     ] Compiling dataset definition from dataset_definition.py
2023-11-13 12:59:36 [info     ] Generating dummy dataset
2023-11-13 12:59:36 [info     ] Reading dummy data from dummy.csv
Traceback (most recent call last):
  File "/app/ehrql/file_formats/csv.py", line 135, in parser
    return convertor(value)
           ^^^^^^^^^^^^^^^^
ValueError: day is out of range for month

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/ehrql/file_formats/csv.py", line 77, in __iter__
    yield row_parser(row)
          ^^^^^^^^^^^^^^^
  File "/app/ehrql/file_formats/csv.py", line 104, in row_parser
    return tuple(parser(row) for parser in parsers)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/ehrql/file_formats/csv.py", line 104, in <genexpr>
    return tuple(parser(row) for parser in parsers)
                 ^^^^^^^^^^^
  File "/app/ehrql/file_formats/csv.py", line 137, in parser
    raise ValueError(f"column {name!r}: {e}")
ValueError: column 'asthma_med_date': day is out of range for month

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/venv/bin/ehrql", line 8, in <module>
    sys.exit(entrypoint())
             ^^^^^^^^^^^^
  File "/app/ehrql/__main__.py", line 67, in entrypoint
    return main(sys.argv[1:], environ=os.environ)  # pragma: no cover
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/ehrql/__main__.py", line 92, in main
    function(**kwargs)
  File "/app/ehrql/main.py", line 67, in generate_dataset
    generate_dataset_with_dummy_data(
  File "/app/ehrql/main.py", line 111, in generate_dataset_with_dummy_data
    reader = read_dataset(dummy_data_file, column_specs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/ehrql/file_formats/__init__.py", line 39, in read_dataset
    return reader(filename, column_specs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/ehrql/file_formats/base.py", line 18, in __init__
    self._validate_basic()
  File "/app/ehrql/file_formats/csv.py", line 66, in _validate_basic
    for _ in zip(self, range(10)):
  File "/app/ehrql/file_formats/csv.py", line 79, in __iter__
    raise ValidationError(f"row {n}: {e}")
ehrql.file_formats.base.ValidationError: row 1: column 'asthma_med_date': day is out of range for month
rebkwok commented 10 months ago

Using a case statement, but forgetting to select a column in the then part results in a particularly unfriendly error message:

e.g. this is fine

>>> case(when(patients.sex == "male").then(patients.sex))
 1 | None
 2 | male
 3 | male
 4 | None
 5 | male
 6 | None
 7 | male
 8 | None
 9 | None
10 | male

But forgetting to select patients.sex as the then clause does this:

>>> case(when(patients.sex == "male").then(patients))
Traceback (most recent call last):
  File "/app/ehrql/query_language.py", line 936, in _wrap
    cls = REGISTERED_TYPES[type_, is_patient_level]
          ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: (<class 'ehrql.tables.beta.core.patients'>, True)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<console>", line 1, in <module>
AssertionError
sebbacon commented 10 months ago

The error about "unrelated domains" is pretty scary-looking in its verbosity (lots of query model output) and unclear to a novice what to do about it:

Hi, I have an ehrQL query if anyone can help... I'm trying to update our dataset_definition for the next stage of analysis, and its generating an error in Visual Studio that is seemingly linked to an existing piece of code, and not something new that I've added. The error is as shown below. https://github.com/opensafely/end-of-life-carequality/tree/Descriptive_count_edits_MD Thank you!

   logs:
     2023-12-08 12:35:38 [info     ] Compiling dataset definition from analysis/os_reports/dataset_definition_os_reports.py
     Failed to import 'analysis/os_reports/dataset_definition_os_reports.py':
     Traceback (most recent call last):
       File "/workspace/analysis/os_reports/dataset_definition_os_reports.py", line 41, in <module>
         practice_registrations.where(practice_registrations.start_date <= dod_ons)
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     ehrql.query_model.nodes.DomainMismatchError: Attempt to combine unrelated domains:
     {Domain(lineage=('PatientDomain', SelectTable(name='practice_registrations', schema=TableSchema(start_date=Column(datetime.date, constraints=(Constraint.NotNull(),)), end_date=Column(datetime.date, constraints=()), practice_pseudo_id=Column(int, constraints=(Constraint.NotNull(),)), practice_stp=Column(str, constraints=(Constraint.Regex(regex='E540000[0-9]{2}'),)), practice_nuts1_region_name=Column(str, constraints=(Constraint.Categorical(values=('North East', 'North West', 'Yorkshire and The Humber', 'East Midlands', 'West Midlands', 'East', 'London',
'South East', 'South West')),)))))), Domain(lineage=('PatientDomain', SelectTable(name='ons_deaths', schema=TableSchema(date=Column(datetime.date, constraints=()), place=Column(str, constraints=(Constraint.Categorical(values=('Care Home', 'Elsewhere', 'Home', 'Hospice', 'Hospital', 'Other communal establishment')),)), underlying_cause_of_death=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_01=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_02=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_03=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_04=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_05=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_06=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_07=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_08=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_09=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_10=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_11=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_12=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_13=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_14=Column(ehrql.codes.ICD10Code, constraints=()), cause_of_death_15=Column(ehrql.codes.ICD10Code, constraints=())))))}

(source)


Possibly now addressed in https://github.com/opensafely-core/ehrql/pull/1873