Revisit ingestion of VIDRL flat files

nextstrain / fauna

RethinkDB database to support real-time virus analysis

GNU Affero General Public License v3.0

33 stars 13 forks source link

Revisit ingestion of VIDRL flat files #161

Open joverlee521 opened 2 months ago

joverlee521 commented 2 months ago

Brought up by @huddlej on Slack that OneDrive includes flat files. Ingesting the flat files should make the rest of #158 easier?

Revisit changes made in https://github.com/nextstrain/fauna/pull/103 and update it to work with the latest version of the flat files.

joverlee521 commented 1 month ago

Thanks @j23414 for investigating the latest flat files 🙏

Jotting down notes for updating the flat file ingest:

the vidrl_flat_file_column_map.tsv will definitely need to be updated
there is a single homologous titre column so our ingest needs to create a row per reference strain to capture these homologous titers
the reference strain use the full strain name so we would no longer need the serum mapping 🎉
human sera pools include the reference strain so we would no longer need to keep track of the vaccine mapping 🎉

joverlee521 commented 1 month ago

there is a single homologous titre column so our ingest needs to create a row per reference strain to capture these homologous titers

Oh, there's a separate file for the reference panel results. Each flat file has a matching *_reference_panel.csv file that includes the references' homologous titers.

joverlee521 commented 1 month ago

The *_reference_panel.csv has a subset of the columns used in the main *_flat_file.csv and it only includes the shortened name of the antisera. So the antisera -> reference name mapping from the _flat_file.csv will need to be preserved to be used for the processing of the matching _reference_panel.csv file.

huddlej commented 1 month ago

@joverlee521 I think we originally asked for the reference panel file and Sheena made it for us. Then later Sheena modified her script that produces the flat files to pull in the relevant information from the reference panel file, so we didn't have to parse that reference information separately.

Is there anything in the reference panel file that we can't get from the flat file by parsing the unique homologous titers like you mentioned above?

We could jump on a huddle tomorrow to chat, if that's helpful. It's been a little while since I looked at these files, too...

joverlee521 commented 1 month ago

Is there anything in the reference panel file that we can't get from the flat file by parsing the unique homologous titers like you mentioned above?

Yeah, looking at the *_flat_file.csv more closely, they are completely missing the reference titer measurements. They only include the results for test virus x reference virus, but do not include any of the reference virus x reference virus results.

huddlej commented 1 month ago

Got it. I can't see the latest files any more (curse OneDrive!), but in the last view I had of those files, they included columns for reference antigen, reference passage, and homologous titre which would represent most of the reference titer measurements we need, but maybe it isn't enough.

To get those homologous reference values into our standard format we would need to make new records for each unique combination of antigen, passage, and titer with the test virus value equal to the reference antigen, test virus passage equal to reference passage, and titre each to homologous titre. We would be missing the antisera and ferret columns, though. We don't need antisera, when it is just an abbreviation of the reference virus name, but we probably want ferret. That supports the case for parsing the separate reference panel file, if that file has that information.

joverlee521 commented 1 month ago

We chatted about this today and decided that we do need to ingest the additional reference_panel.csv. This will ensure our ingest of the flat files includes the all measurements as the previous Excel files.

I'll update tdb/vidrl_upload.py to work with the new flat files and test on a couple Excel/flat file pairs to get a diff of the two paths.