Added a test to identify duplicate prevalences/incidences. Dropped duplicate HSV-1 and HSV-2 seroprevalence estimates.

simonleandergrimm commented 1 year ago

We sometimes had multiple prevalence/incidence estimates for the same pathogen, time, and place. This can lead to errors in downstream analyses, and we instead want to either combine estimates, or drop estimates, with the goal of having only one estimate per time, place, and pathogen.

This PR adds a test, separately spotting duplicate prevalence and incidence estimates. It also drops duplicate estimates for HSV-1 and HSV-2, which we identified through this test.

Fixes #149.

simonleandergrimm commented 1 year ago

By returning, do you mean not adding them to estimate_*, while still keeping them in the pathogen.py file? I'd rather not do that, as it might be confusing.

Given our current set of pathogens, I can simply cut the HSV-1, and HSV-2 duplicate estimates, as those represent the raw NHANES data, while the CDC estimates that are based on that data. I will instead mention the data in the comments of the CDC-based prevalence estimates. In that case, I'd also remove class Primary(Enum):

Does that sound good?

On Fri, 9 Jun 2023 at 14:18, Jeff Kaufman @.***> wrote:

@.**** requested changes on this pull request.

In pathogen_properties.py https://github.com/naobservatory/p2ra/pull/152#discussion_r1224614799:

@@ -39,6 +39,11 @@ class Active(Enum): LATENT = "Latent"

+class Primary(Enum):

What's the argument for adding a primary/secondary distinction instead of just only returning our best estimate for every location+timeperiod+taxid?

— Reply to this email directly, view it on GitHub https://github.com/naobservatory/p2ra/pull/152#pullrequestreview-1472854305, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN7ASMR2PYZM356SK4TDH4TXKNSIFANCNFSM6AAAAAAZA2PXOQ . You are receiving this because you authored the thread.Message ID: @.***>

jeffkaufman commented 1 year ago

@simonleandergrimm I think we should be generating a single best effort estimate, either by cutting low-quality estimates or combining multiple estimates. Your choice which!

simonleandergrimm commented 1 year ago

I cut the estimates, ready for re-review!

naobservatory / p2ra

Added a test to identify duplicate prevalences/incidences. Dropped duplicate HSV-1 and HSV-2 seroprevalence estimates. #152

@.**** requested changes on this pull request.