seattleflu / incidence-mapper

R interface to database, map model training, and model data API Server
MIT License
5 stars 1 forks source link

Simulated data and real data workflow have diverged too far and that affects testing #114

Open famulare opened 5 years ago

famulare commented 5 years ago

@tinghf alerted me that this block of code breaks on the simulated data because sample isn't a valid column.

https://github.com/seattleflu/incidence-mapper/blob/97ad7e23fc0d2b6a7db760fa3e29f1996e782721/dbViewR/R/selectFromDB.R#L138-L153

The short-term fix is to wrap this block with an if(source == 'production') as in

if(source == 'production'){

# filter out nested PCR targets to retain high-level target only
  # Flu A
  keepTargetList <- unique(db$sample[db$pathogen %in% c("Flu_A_H1","Flu_A_H3")])
  dropTargetList <- unique(db$sample[db$pathogen %in% c("Flu_A_pan")])

  dropSampleList <- intersect(dropTargetList,keepTargetList)

  db <- db %>% filter( !(sample %in% dropSampleList & db$pathogen %in% c("Flu_A_pan")))

  # enterovirus
  keepTargetList <- unique(db$sample[db$pathogen %in% c("EV_D68")])
  dropTargetList <- unique(db$sample[db$pathogen %in% c("EV_pan")])

  dropSampleList <- intersect(dropTargetList,keepTargetList)

  db <- db %>% filter( !(sample %in% dropSampleList & db$pathogen %in% c("EV_pan")))
}

Long term, we should keep the simulated data synchronized with the necessary test cases. You can see the workflow pattern to do that in commits to the simulated-data repo: https://github.com/seattleflu/simulated-data/commits/master.

tinghf commented 5 years ago

in bamboo this manifest as error like following:

Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments Calls: expandDB ... as.data.frame -> filter -> filter.tbl_df -> filter_impl -> %in% Execution halted

That's failed in following line in selectFromDB.R:

db <- db %>% filter( !(sample %in% dropSampleList & db$pathogen %in% c("EV_pan")))