loculus-project / loculus

An open-source software package to power microbial genomic databases
https://loculus.org
GNU Affero General Public License v3.0
35 stars 2 forks source link

Set up LAPIS with data with revisions #4

Closed chaoran-chen closed 1 year ago

chaoran-chen commented 1 year ago

Depends on #2

fengelniederhammer commented 1 year ago

Closely related to https://github.com/GenSpectrum/servers/issues/5

chaoran-chen commented 1 year ago

The test dataset contains a column genbank_accession_rev which is structured as <genbank accession>.<revision> (example: LR862399.1).

For pathoplexus, we need to further transform the data a bit. As described in https://github.com/orgs/pathoplexus/discussions/2#discussioncomment-6012375, I suggested having three columns:

  • revision (integer): When updating, we don't actually change the old entry but add a new one with the same accession and an incremented revision number
  • is_most_recent (boolean): this is only set to true for the most recent revisions of each sample
  • is_revoked (boolean): this is set to true if a sequence was revoked

@corneliusroemer, if it's very easy for you, could you maybe create an adapted metadata.tsv with these three additional columns? If you don't have time, it's also totally fine (you've already done a lot!)

corneliusroemer commented 1 year ago

Sure this is quite easy.

Is revoked true for all non-latest?

What about RKI sequences? Do you ignore them?

chaoran-chen commented 1 year ago

No, let's set revoked to false for all the entries for now. (For testing purposes, we can then randomly set it to true for a few entries.)

It would be great to remove all RKI sequences for this test dataset and use genbank_accession_rev as the primary key.

chaoran-chen commented 1 year ago

@corneliusroemer, have you already had the time to work on this?

@JonasKellerer also noticed that in the file the header line occurs multiple times (I believe that he pinged you on Slack).

JonasKellerer commented 1 year ago

Shortly what I noticed: The header occurs two times, once at the start and once in the middle of the file. After the second header the file also contains more columns than before (clade_who and clade_nextstrain). Maybe this is linked to the different versions of the metadata which are used and the cat command in line 164 of the Snakefile?

If I could add one more wish, I would ask if missing data is not displayed as a "?" but as an empty string.

Thanks in advance.

chaoran-chen commented 1 year ago

Now that we are working on the submission server, we will just use data from there. Closing this issue! (cc @corneliusroemer)