naobservatory / p2ra

4 stars 1 forks source link

Match all target MGS samples with public health data #136

Closed jeffkaufman closed 1 year ago

jeffkaufman commented 1 year ago

For modeling, we need to identify a public health estimate for each pathogen for each MGS sample. In #133 I added check-matches.py to tell us about the cases where we were not seeing ones, and here I (a) fix the mismatches and (b) integrate check-matches.py into test.py so we don't accidentally regress.

To fix the mismatches I used a few different strategies:

  1. For HIV and HAV I edited the pathogen file to extrapolate to future years. This makes sense because we expect these prevalences to be reasonably constant, and if they're not then the pathogen file is the right place to adjust.
  2. For Influenza and a few others we had estimates for specific days that didn't match the specific days where we had MGS samples. I tweaked the matching to take the closest variable within two weeks. Possibly we should be fancy and interpolate, but I doubt this is needed.
  3. In some cases (HIV) we have estimates for one county that are really for the metropolitan area. Modified the matching to understand these.
  4. For Influenza we were estimating separately for Flu A and B, but the modeling code ignored any estimates that set a taxid. I didn't fix these, I just moved the check to choose_predictor and @dp-rice can fix (ok?)
  5. For West Nile and Dengue I just skipped them, because we don't care about these (far too rare)