Open joverlee521 opened 2 months ago
Thanks @j23414 for investigating the latest flat files 🙏
Jotting down notes for updating the flat file ingest:
homologous titre
column so our ingest needs to create a row per reference strain to capture these homologous titersthere is a single homologous titre column so our ingest needs to create a row per reference strain to capture these homologous titers
Oh, there's a separate file for the reference panel results. Each flat file has a matching *_reference_panel.csv
file that includes the references' homologous titers.
The *_reference_panel.csv
has a subset of the columns used in the main *_flat_file.csv
and it only includes the shortened name of the antisera
. So the antisera
-> reference name
mapping from the _flat_file.csv
will need to be preserved to be used for the processing of the matching _reference_panel.csv
file.
@joverlee521 I think we originally asked for the reference panel file and Sheena made it for us. Then later Sheena modified her script that produces the flat files to pull in the relevant information from the reference panel file, so we didn't have to parse that reference information separately.
Is there anything in the reference panel file that we can't get from the flat file by parsing the unique homologous titers like you mentioned above?
We could jump on a huddle tomorrow to chat, if that's helpful. It's been a little while since I looked at these files, too...
Is there anything in the reference panel file that we can't get from the flat file by parsing the unique homologous titers like you mentioned above?
Yeah, looking at the *_flat_file.csv
more closely, they are completely missing the reference titer measurements. They only include the results for test virus x reference virus, but do not include any of the reference virus x reference virus results.
Got it. I can't see the latest files any more (curse OneDrive!), but in the last view I had of those files, they included columns for reference antigen
, reference passage
, and homologous titre
which would represent most of the reference titer measurements we need, but maybe it isn't enough.
To get those homologous reference values into our standard format we would need to make new records for each unique combination of antigen, passage, and titer with the test virus
value equal to the reference antigen
, test virus passage
equal to reference passage
, and titre
each to homologous titre
. We would be missing the antisera
and ferret
columns, though. We don't need antisera
, when it is just an abbreviation of the reference virus name, but we probably want ferret
. That supports the case for parsing the separate reference panel file, if that file has that information.
We chatted about this today and decided that we do need to ingest the additional reference_panel.csv
. This will ensure our ingest of the flat files includes the all measurements as the previous Excel files.
I'll update tdb/vidrl_upload.py to work with the new flat files and test on a couple Excel/flat file pairs to get a diff of the two paths.
Brought up by @huddlej on Slack that OneDrive includes flat files. Ingesting the flat files should make the rest of #158 easier?
Revisit changes made in https://github.com/nextstrain/fauna/pull/103 and update it to work with the latest version of the flat files.