A few things were going on here, and was a fun rabbit hole!
TL/DR : main things contributing to the random failing were:
Floating point precision
SQL median is weird
Accidentally using random values for Patient.diagnosis_date
All have been fixed. Run the test many times as it should now be fine (importantly, no random values are being used now, so one pass = deterministic!).
Fun details on above:
1. Floating point precision
We have our Visit.hba1c field defined as a DecimalField. The actual kpi calculation method querying this outputs a python Decimal( ) object. My test was just using float( ) objects. Good article on difference: https://www.laac.dev/blog/float-vs-decimal-python/ but generally:
floats are good for speed & convenience, use the classic IEEE 754 binary representation, can lead to rounding precision errors
Decimal, somehow uses base-10 representation`, but means it can accurately represent decimal numbers without the traditional floating point errors
This was partly leading to errors as tests are comparing different types of values. Fixed by converting expected values into Decimal. The output of the calculate_kpi_45_median_hba1c still returns a float though for convenience
2. SQL median is weird
There is no standard implementation of median in Django's ORM because it's more complicated to calculate. I created a custom SQL class for it but that ended up being harder to reason with. Instead, did a simpler approach of just using ORM to retrieve a list of hba1c values, and then calculate the median in Python. Much more straightforward.
3. Accidentally using random values for Patient.diagnosis_date
Rookie error! In the tests, just forgot to set PatientFactory.diagnosis_date. Expected behavior of the PatientFactory is to generate random valid values for any not given (to make it easier to create Patient objects).
Fixed by just defining.
Code changes
tests updated to use Decimal() values for hba1c instead of float
tests updated to specify diagnosis_date, thus not relying on the PatientFactory() generating a random diagnosis_date
tests updated to be a little more clear for the invalid case
calculate_median implementation moved into Python instead of SQL, simplifying and circumventing harder-to-reason precision errors
calculate_kpi_45_median_hba1c method more straightforward because of using a Python way of calculating median
also update type annotations of returns from calculate_kpi_ methods to KPIResult
Signed-off-by: anchit-chandran anchit97123@gmail.com
Overview
A few things were going on here, and was a fun rabbit hole!
TL/DR : main things contributing to the random failing were:
Patient.diagnosis_date
All have been fixed. Run the test many times as it should now be fine (importantly, no random values are being used now, so one pass = deterministic!).
Fun details on above:
1. Floating point precision
We have our
Visit.hba1c
field defined as aDecimalField
. The actual kpi calculation method querying this outputs a pythonDecimal( )
object. My test was just usingfloat( )
objects. Good article on difference: https://www.laac.dev/blog/float-vs-decimal-python/ but generally:float
s are good for speed & convenience, use the classicIEEE 754
binary representation, can lead to rounding precision errorsDecimal
, somehow usesbase-10
representation`, but means it can accurately represent decimal numbers without the traditional floating point errorsThis was partly leading to errors as tests are comparing different types of values. Fixed by converting expected values into
Decimal
. The output of thecalculate_kpi_45_median_hba1c
still returns afloat
though for convenience2. SQL median is weird
There is no standard implementation of median in Django's ORM because it's more complicated to calculate. I created a custom SQL class for it but that ended up being harder to reason with. Instead, did a simpler approach of just using ORM to retrieve a list of hba1c values, and then calculate the median in Python. Much more straightforward.
3. Accidentally using random values for
Patient.diagnosis_date
Rookie error! In the tests, just forgot to set
PatientFactory.diagnosis_date
. Expected behavior of thePatientFactory
is to generate random valid values for any not given (to make it easier to createPatient
objects).Fixed by just defining.
Code changes
Decimal()
values for hba1c instead offloat
diagnosis_date
, thus not relying on thePatientFactory()
generating a randomdiagnosis_date
calculate_median
implementation moved into Python instead of SQL, simplifying and circumventing harder-to-reason precision errorscalculate_kpi_45_median_hba1c
method more straightforward because of using a Python way of calculating mediancalculate_kpi_
methods toKPIResult
Related Issues
https://github.com/orgs/rcpch/projects/13?pane=issue&itemId=85211949&issue=rcpch%7Cnational-paediatric-diabetes-audit%7C338
Mentions
@mentions of the person or team responsible for reviewing proposed changes.