Closed chaoran-chen closed 1 year ago
Closely related to https://github.com/GenSpectrum/servers/issues/5
The test dataset contains a column genbank_accession_rev
which is structured as <genbank accession>.<revision>
(example: LR862399.1
).
For pathoplexus, we need to further transform the data a bit. As described in https://github.com/orgs/pathoplexus/discussions/2#discussioncomment-6012375, I suggested having three columns:
- revision (integer): When updating, we don't actually change the old entry but add a new one with the same accession and an incremented revision number
- is_most_recent (boolean): this is only set to true for the most recent revisions of each sample
- is_revoked (boolean): this is set to true if a sequence was revoked
@corneliusroemer, if it's very easy for you, could you maybe create an adapted metadata.tsv with these three additional columns? If you don't have time, it's also totally fine (you've already done a lot!)
Sure this is quite easy.
Is revoked true for all non-latest?
What about RKI sequences? Do you ignore them?
No, let's set revoked
to false for all the entries for now. (For testing purposes, we can then randomly set it to true for a few entries.)
It would be great to remove all RKI sequences for this test dataset and use genbank_accession_rev
as the primary key.
@corneliusroemer, have you already had the time to work on this?
@JonasKellerer also noticed that in the file the header line occurs multiple times (I believe that he pinged you on Slack).
Shortly what I noticed: The header occurs two times, once at the start and once in the middle of the file. After the second header the file also contains more columns than before (clade_who and clade_nextstrain). Maybe this is linked to the different versions of the metadata which are used and the cat command in line 164 of the Snakefile?
If I could add one more wish, I would ask if missing data is not displayed as a "?" but as an empty string.
Thanks in advance.
Now that we are working on the submission server, we will just use data from there. Closing this issue! (cc @corneliusroemer)
Depends on #2