nextstrain / fauna

RethinkDB database to support real-time virus analysis
GNU Affero General Public License v3.0
33 stars 13 forks source link

Fix cdc uploads #131

Closed joverlee521 closed 1 year ago

joverlee521 commented 2 years ago

Description of proposed changes

Fixes a couple of CDC titer upload issues:

The serum_id and serum_passage_category fields are part of the upload index fields, so we will need to delete records from cdc_tdb/flu before we can upload data with these new changes.

TODOs

joverlee521 commented 1 year ago

Everything worked as expected.

Yay! 🎉

My only question based on the testing below is whether we want to add serum_host as a trailing column to the tdb downloads. I think it would be nice to have a way to explicitly check the status of all records, even (or especially) the records without proper host annotation.

Sounds good to me. We will definitely want this column if we get human sera measurements from other CCs as well.

joverlee521 commented 1 year ago

If I pull records from cdc_tdb/flu with query by --interval assay_date:2019-09-03,2022-11-28, this downloads 72,149 records. The new upload to test_tdb/flu only has 65,199 records. That's a difference of 6,950 records, which is not great...

However, I get 65,199 records even when I upload to test_tdb/flu with the old CDC upload on the main branch, so this difference in the number of records is not a direct effect of the changes in this PR. When I diff the records of the main branch upload and the new upload, the breakdown of records by assay date is identical.

When I diff the records of the new upload and the downloaded records from cdc_tdb by assay date, there are differences spread throughout the queried time span.

See detailed diff by assay date |assay\_date |cdc\_tdb|new\_test\_tdb|diff | |------------|------:|------:|------------------------:| |2019\-09\-03|324 |270 |54 | |2019\-09\-04|232 |200 |32 | |2019\-09\-05|384 |320 |64 | |2019\-09\-09|384 |320 |64 | |2019\-09\-10|116 |105 |11 | |2019\-09\-12|295 |270 |25 | |2019\-09\-17|570 |432 |138 | |2019\-09\-19|280 |200 |80 | |2019\-09\-25|276 |240 |36 | |2019\-09\-26|259 |185 |74 | |2019\-10\-01|248 |196 |52 | |2019\-10\-03|552 |488 |64 | |2019\-10\-08|215 |192 |23 | |2019\-10\-10|438 |331 |107 | |2019\-10\-15|364 |260 |104 | |2019\-10\-16|320 |252 |68 | |2019\-10\-17|445 |300 |145 | |2019\-10\-23|328 |252 |76 | |2019\-10\-29|593 |451 |142 | |2019\-10\-31|215 |192 |23 | |2019\-11\-06|331 |259 |72 | |2019\-11\-07|333 |200 |133 | |2019\-11\-12|326 |256 |70 | |2019\-11\-14|395 |270 |125 | |2019\-11\-19|564 |467 |97 | |2019\-11\-20|341 |0 |341 | |2019\-11\-21|336 |250 |86 | |2019\-11\-26|587 |470 |117 | |2019\-11\-27|255 |224 |31 | |2019\-12\-03|348 |270 |78 | |2019\-12\-04|304 |238 |66 | |2019\-12\-05|264 |192 |72 | |2019\-12\-10|404 |272 |132 | |2019\-12\-11|279 |245 |34 | |2019\-12\-17|755 |626 |129 | |2019\-12\-18|311 |245 |66 | |2019\-12\-19|403 |310 |93 | |2019\-12\-27|247 |203 |44 | |2020\-01\-07|299 |230 |69 | |2020\-01\-08|207 |182 |25 | |2020\-01\-09|314 |272 |42 | |2020\-01\-13|340 |200 |140 | |2020\-01\-14|288 |192 |96 | |2020\-01\-17|233 |208 |25 | |2020\-01\-28|247 |192 |55 | |2020\-01\-30|248 |200 |48 | |2020\-02\-04|288 |232 |56 | |2020\-02\-05|277 |184 |93 | |2020\-02\-13|680 |274 |406 | |2020\-02\-14|513 |431 |82 | |2020\-02\-18|350 |110 |240 | |2020\-02\-19|462 |402 |60 | |2020\-02\-20|322 |297 |25 | |2020\-02\-25|388 |330 |58 | |2020\-02\-26|295 |259 |36 | |2020\-02\-27|300 |240 |60 | |2020\-03\-04|266 |259 |7 | |2020\-03\-12|902 |820 |82 | |2020\-03\-20|375 |319 |56 | |2020\-03\-26|500 |436 |64 | |2020\-04\-02|618 |527 |91 | |2020\-04\-09|656 |557 |99 | |2020\-04\-14|159 |144 |15 | |2020\-05\-05|376 |342 |34 | |2020\-05\-07|620 |589 |31 | |2020\-06\-25|171 |152 |19 | |2020\-07\-14|252 |180 |72 | |2020\-08\-25|241 |190 |51 | |2020\-08\-28|209 |201 |8 | |2020\-09\-01|180 |152 |28 | |2020\-09\-03|253 |207 |46 | |2020\-09\-30|369 |319 |50 | |2020\-11\-03|397 |341 |56 | |2020\-11\-06|165 |120 |45 | |2020\-12\-08|327 |264 |63 | |2020\-12\-15|561 |491 |70 | |2021\-01\-07|148 |120 |28 | |2021\-01\-12|179 |160 |19 | |2021\-03\-09|742 |508 |234 | |2021\-03\-11|72 |71 |1 | |2021\-03\-17|446 |374 |72 | |2021\-06\-30|208 |195 |13 | |2021\-09\-03|336 |240 |96 | |2021\-09\-23|209 |208 |1 | |2021\-11\-24|252 |240 |12 | |2022\-01\-14|345 |286 |59 | |2022\-01\-27|210 |208 |2 | |2022\-03\-11|336 |300 |36 | |2022\-03\-17|207 |206 |1 | |2022\-03\-29|360 |315 |45 | |2022\-03\-30|364 |350 |14 | |2022\-04\-08|368 |304 |64 | |2022\-04\-13|467 |454 |13 | |2022\-04\-21|220 |219 |1 | |2022\-05\-17|425 |391 |34 | |2022\-05\-19|198 |197 |1 | |2022\-06\-30|315 |300 |15 | |2022\-07\-21|219 |207 |12 | |2022\-07\-27|375 |345 |30 | |2022\-08\-11|209 |207 |2 | |2022\-08\-19|285 |266 |19 | |2022\-09\-02|352 |330 |22 | |2022\-09\-07|512 |480 |32 | |2022\-09\-08|391 |345 |46 | |2022\-09\-14|408 |360 |48 | |2022\-09\-21|360 |336 |24 | |2022\-09\-29|340 |300 |40 | |2022\-10\-05|218 |217 |1 | |2022\-10\-13|618 |570 |48 | |2022\-10\-20|384 |360 |24 | |2022\-10\-26|432 |405 |27 | |2022\-11\-10|368 |345 |23 | |2022\-11\-16|320 |300 |20 |

This makes me think that the CDC actually deleted these records in their data dump, but we just never deleted them from fauna. I'm comfortable with deleting the records and just start with a clean slate of CDC data from 2019-09-03, but would like to hear other opinions here! cc: @huddlej @rneher @trvrb