Closed joverlee521 closed 1 year ago
Everything worked as expected.
Yay! 🎉
My only question based on the testing below is whether we want to add serum_host as a trailing column to the tdb downloads. I think it would be nice to have a way to explicitly check the status of all records, even (or especially) the records without proper host annotation.
Sounds good to me. We will definitely want this column if we get human sera measurements from other CCs as well.
If I pull records from cdc_tdb/flu
with query by --interval assay_date:2019-09-03,2022-11-28
, this downloads 72,149 records. The new upload to test_tdb/flu
only has 65,199 records. That's a difference of 6,950 records, which is not great...
However, I get 65,199 records even when I upload to test_tdb/flu
with the old CDC upload on the main branch, so this difference in the number of records is not a direct effect of the changes in this PR. When I diff the records of the main branch upload and the new upload, the breakdown of records by assay date is identical.
When I diff the records of the new upload and the downloaded records from cdc_tdb
by assay date, there are differences spread throughout the queried time span.
This makes me think that the CDC actually deleted these records in their data dump, but we just never deleted them from fauna. I'm comfortable with deleting the records and just start with a clean slate of CDC data from 2019-09-03, but would like to hear other opinions here! cc: @huddlej @rneher @trvrb
Description of proposed changes
Fixes a couple of CDC titer upload issues:
126
130
129
The
serum_id
andserum_passage_category
fields are part of the upload index fields, so we will need to delete records fromcdc_tdb/flu
before we can upload data with these new changes.TODOs
test_tdb/flu
.test_tdb/flu
to make sureserum_host
is correctly parsed for "human" and "mouse"serum_passage_category
is correctly parsed for human pool seracdc_tdb/flu
, maybe filter viaassay_date
. The earliestassay_date
included in the current CDC database dump is 2019-09-03.