loculus-project / loculus

An open-source software package to power microbial genomic databases
https://loculus.org
GNU Affero General Public License v3.0
37 stars 2 forks source link

Add "earliest release date" #2894

Open theosanderson opened 1 month ago

theosanderson commented 1 month ago

Currently on Pathoplexus we default to showing an "NCBI release date" field. This is good because if we used the Pathoplexus release date field all sequences ingested from INSDC would have the same release date which wouldn't be useful. But it's bad because it is undefined for sequences submitted directly to pathoplexus. IMO we should create a consensus field which is the NCBI release date if set (which will generally also be the Pathoplexus release date for items submitted to INSDC after launch) and otherwise the Pathoplexus release date. I'm raising this in Loculus as the code implementation would be here.

theosanderson commented 1 week ago

We would like this field to be the earliest date of:

chaoran-chen commented 1 week ago

Do you have an idea how we can implement this? Loculus doesn't know about an INSDC release date and the pipeline cannot possibly know about the Pathoplexus/Loculus release date for the first version as it runs before a sequence is released.

theosanderson commented 1 week ago

I think this would need to be a built-in field, like the Loculus release date itself. So I guess it would be computed in get-released-data by taking the earliest of these fields. Loculus does know the INSDC release date for ingested sequences which is the only relevant case for this issue.

chaoran-chen commented 4 days ago

What do you think about the following idea:

We implement a feature that allows an admin to (optionally) specify a script or Docker image that will be called by the silo_import_job.sh before starting the SILO preprocessing. For Pathoplexus:

  1. We add a metadata field for earliest release date which the preprocessing pipelne leaves empty.
  2. We add a script between /get-released-data and SILO preprocessing that modifies the data file and computes the date.

It's not the most performant/optimized solution but highly flexible.