Running etlr garden/neglected_tropical_diseases/2024-05-02/soil_transmitted_helminthiases --only --force on local and production gives different results for some countries & years. It is because we're dropping duplicates and the order could be different depending on chance (it looks like meadow ordering of rows could be different).
I looked at some duplicates and here's one example
country year drug_combination__pre_sac ... reported_number_of_sac_treated programme_coverage__sac__pct national_coverage__sac__pct
1238 Haiti 2013 Alb ... 365948.000 65.975807 15.519107
1239 Haiti 2013 Dec+Alb ... 1693225.375 NaN 71.806229
We should avoid non-deterministic code, and prefer to force it down one path, even if we only have weak confidence that it's the right one.
Impact
When data steps are non-deterministic, then we will see spurious differences come up between datasets when we try to compare code built on different machines, which adds noise to our workflow.
It's also unclear which of the possible outcomes is the one we want.
Approaches
We should either sort rows before dropping duplicates or dig into why are there duplicates in the first place (and possible drop certain categories from drug_combination__pre_sac).
Problem
Running
etlr garden/neglected_tropical_diseases/2024-05-02/soil_transmitted_helminthiases --only --force
on local and production gives different results for some countries & years. It is because we're dropping duplicates and the order could be different depending on chance (it looks like meadow ordering of rows could be different).I looked at some duplicates and here's one example
We drop those duplicates here.
We should avoid non-deterministic code, and prefer to force it down one path, even if we only have weak confidence that it's the right one.
Impact
When data steps are non-deterministic, then we will see spurious differences come up between datasets when we try to compare code built on different machines, which adds noise to our workflow.
It's also unclear which of the possible outcomes is the one we want.
Approaches
We should either sort rows before dropping duplicates or dig into why are there duplicates in the first place (and possible drop certain categories from
drug_combination__pre_sac
).cc @spoonerf