loculus-project / loculus

An open-source software package to power microbial genomic databases
https://loculus.org
GNU Affero General Public License v3.0
34 stars 1 forks source link

Include extra metadata from Genbank that is in the full Genbank files but not (yet) parsed and emitted by NCBI virus #2834

Open corneliusroemer opened 1 day ago

corneliusroemer commented 1 day ago

@theosanderson raised a good point:

Hmm does this mean we could be failing to ingest some other quite important data? (Not a criticism - just for understanding, I guess I previously thought we were capturing 100% of data) (and it's a genuine question - maybe this would be the only thing that we are not capturing)

in #2832

Ingest currently only knows about things that NCBI virus emits. It is known that there's (sometimes) more data that is available in individual genbank files and not (yet) parsed by NCBI Virus.

We could manually request those individual files in ingest and parse out extra metadata - this is not urgent but nice to have, depending on how much extra metadata we could be pulling in like this.

chaoran-chen commented 1 day ago

Agree that it would be amazing to have. In particular sequencing technologies and full author names would be cool! But I also agree that it's not urgent.