serum_passage_category should be set to "egg" instead of "cell" for CDC human pool data like "L21/22 H3-EGG HUMAN POOL"

huddlej commented 2 years ago

Current Behavior

Human pool titers represent measurements for people vaccinated with either cell-passaged or egg-passaged vaccine strains. Data from the CDC represent this passage status with names like L21/22 H3-EGG HUMAN POOL in the serum id. Egg-passaged data appear in the cell-passaged downloads from fauna, however. For example, the following command returns a list of egg-passaged data for H3N2:

grep H3-EGG data/h3n2/who_cell_fra_titers.tsv

Expected behavior

These egg-passaged data should only appear in the corresponding egg-passaged titer file (e.g., data/h3n2/who_egg_fra_titers.tsv for the example above). The serum_passage_category of these records should be set to egg instead of cell.

Possible solution

We may need to check each measurement's serum id for the appearance of "egg" and override the inferred serum passage status based on what we find. For example, similar logic already exists to set the "host" for each measurement based. There might be a cleaner fauna-style way to implement this check though.

huddlej commented 2 years ago

@joverlee521 Maybe we can work on this together? It seems like a good opportunity for me to learn more about fauna's internal workings...

joverlee521 commented 2 years ago

Here's the current parsing of the serum passage category for CDC titers:

The original sr_passage column in the CDC TSV is mapped to serum_antigen_passage.
Within tdb/cdc_upload, the serum_antigen_passage column is used to infer serum_passage_category.
The format_passage method is inherited from vdb/flu_upload, which uses a series of regexes to parse the passage category.

We can special case the human pool titers and use the lot_number to format the serum_passage_category. (lot_number is the column that contains the names like 21/22 H3-EGG HUMAN POOL since the serum_id formatting happens after the serum passage formatting)

huddlej commented 2 years ago

Thank you for laying out the steps so clearly, @joverlee521! Special casing the human pool titers sounds reasonable. Would that logic live in the format_passage function?

joverlee521 commented 2 years ago

Special casing the human pool titers sounds reasonable. Would that logic live in the format_passage function?

Hmm, I'm a little hesitant to make format_passage any more complicated 😅 Maybe we can just keep all the human pool specific logic in one place within tdb/cdc_upload:

diff --git a/tdb/cdc_upload.py b/tdb/cdc_upload.py
index 3a007c2..7aa6b3d 100644
--- a/tdb/cdc_upload.py
+++ b/tdb/cdc_upload.py
@@ -72,6 +72,7 @@ class cdc_upload(upload):
                 self.test_virus_strains.add(meas['virus_strain'])
             if "Human" in meas['serum_id']:
                 meas['serum_host'] = 'human'
+                self.format_passage(meas, 'serum_id', 'serum_passage_category')
             self.rethink_io.check_optional_attributes(meas, self.optional_fields)
             self.remove_fields(meas)
         if len(self.new_different_date_format) > 0:

huddlej commented 2 years ago

I know what you mean! That function is among the hairier I've seen in this repo. If we start getting human data from other CCs, though, would you want to encode the human-specific parsing in each respective upload script? Or just refactor any shared parsing logic into a new function when we need to?

joverlee521 commented 2 years ago

Yup, I would want to keep the human-specific parsing in each respective upload script because I'm expecting each CC to provide them in different formats...If there's any parsing logic that can be shared then we can refactor into a new function.

huddlej commented 2 years ago

Sounds good to me!

nextstrain / fauna