Step `garden/neglected_tropical_diseases/2024-05-02/soil_transmitted_helminthiases` gives non-deterministic results

Marigold commented 1 week ago

Problem

Running etlr garden/neglected_tropical_diseases/2024-05-02/soil_transmitted_helminthiases --only --force on local and production gives different results for some countries & years. It is because we're dropping duplicates and the order could be different depending on chance (it looks like meadow ordering of rows could be different).

I looked at some duplicates and here's one example

     country  year drug_combination__pre_sac  ... reported_number_of_sac_treated programme_coverage__sac__pct  national_coverage__sac__pct
1238   Haiti  2013                       Alb  ...                     365948.000                    65.975807                    15.519107
1239   Haiti  2013                   Dec+Alb  ...                    1693225.375                          NaN                    71.806229

We drop those duplicates here.

We should avoid non-deterministic code, and prefer to force it down one path, even if we only have weak confidence that it's the right one.

Impact

When data steps are non-deterministic, then we will see spurious differences come up between datasets when we try to compare code built on different machines, which adds noise to our workflow.

It's also unclear which of the possible outcomes is the one we want.

Approaches

We should either sort rows before dropping duplicates or dig into why are there duplicates in the first place (and possible drop certain categories from drug_combination__pre_sac).

cc @spoonerf

larsyencken commented 3 days ago

@spoonerf This one's on your plate, but let us know if you run into any trouble.

spoonerf commented 3 days ago

I will try and get to it today, and let you know if I run into any trouble!

spoonerf commented 2 days ago

Should be solved with #2908

owid / etl