Closed huddlej closed 1 year ago
@joverlee521 Maybe we can work on this together? It seems like a good opportunity for me to learn more about fauna's internal workings...
Here's the current parsing of the serum passage category for CDC titers:
sr_passage
column in the CDC TSV is mapped to serum_antigen_passage
.serum_antigen_passage
column is used to infer serum_passage_category
.format_passage
method is inherited from vdb/flu_upload, which uses a series of regexes to parse the passage category. We can special case the human pool titers and use the lot_number
to format the serum_passage_category
. (lot_number
is the column that contains the names like 21/22 H3-EGG HUMAN POOL
since the serum_id
formatting happens after the serum passage formatting)
Thank you for laying out the steps so clearly, @joverlee521! Special casing the human pool titers sounds reasonable. Would that logic live in the format_passage
function?
Special casing the human pool titers sounds reasonable. Would that logic live in the format_passage function?
Hmm, I'm a little hesitant to make format_passage
any more complicated 😅
Maybe we can just keep all the human pool specific logic in one place within tdb/cdc_upload
:
diff --git a/tdb/cdc_upload.py b/tdb/cdc_upload.py
index 3a007c2..7aa6b3d 100644
--- a/tdb/cdc_upload.py
+++ b/tdb/cdc_upload.py
@@ -72,6 +72,7 @@ class cdc_upload(upload):
self.test_virus_strains.add(meas['virus_strain'])
if "Human" in meas['serum_id']:
meas['serum_host'] = 'human'
+ self.format_passage(meas, 'serum_id', 'serum_passage_category')
self.rethink_io.check_optional_attributes(meas, self.optional_fields)
self.remove_fields(meas)
if len(self.new_different_date_format) > 0:
I know what you mean! That function is among the hairier I've seen in this repo. If we start getting human data from other CCs, though, would you want to encode the human-specific parsing in each respective upload script? Or just refactor any shared parsing logic into a new function when we need to?
Yup, I would want to keep the human-specific parsing in each respective upload script because I'm expecting each CC to provide them in different formats...If there's any parsing logic that can be shared then we can refactor into a new function.
Sounds good to me!
Current Behavior
Human pool titers represent measurements for people vaccinated with either cell-passaged or egg-passaged vaccine strains. Data from the CDC represent this passage status with names like
L21/22 H3-EGG HUMAN POOL
in the serum id. Egg-passaged data appear in the cell-passaged downloads from fauna, however. For example, the following command returns a list of egg-passaged data for H3N2:Expected behavior
These egg-passaged data should only appear in the corresponding egg-passaged titer file (e.g.,
data/h3n2/who_egg_fra_titers.tsv
for the example above). Theserum_passage_category
of these records should be set toegg
instead ofcell
.Possible solution
We may need to check each measurement's serum id for the appearance of "egg" and override the inferred serum passage status based on what we find. For example, similar logic already exists to set the "host" for each measurement based. There might be a cleaner fauna-style way to implement this check though.