Ingest: get nucleotide accessions to ignore from extended metadata table

loculus-project / loculus

An open-source software package to power microbial genomic databases

https://loculus.org

GNU Affero General Public License v3.0

37 stars 2 forks source link

Ingest: get nucleotide accessions to ignore from extended metadata table #2832

Closed corneliusroemer closed 1 month ago

corneliusroemer commented 1 month ago

We need to find out the nucleotide accessions that correspond to ENA deposited sequences so ingest can ignore these (otherwise we end up with infinite loop).

This might require some changes to ena submission so we find out the nucleotide accessions from the GCA assembly accessions.

For reference:

@anna-parker (this is a test)

theosanderson commented 1 month ago

We plan to annotate our ENA depositions with PP metadata right (the PP accession, as some cross-reference?) Can't we use that?

corneliusroemer commented 1 month ago

It probably doesn't end up in NCBI virus export so can't do that easily afaict

@anna-parker

theosanderson commented 1 month ago

Hmm does this mean we could be failing to ingest some other quite important data? (Not a criticism - just for understanding, I guess I previously thought we were capturing 100% of data)

(and it's a genuine question - maybe this would be the only thing that we are not capturing)

theosanderson commented 1 month ago

Also, we shouldn't submit ingested sequences, so it wouldn't be an infinite loop (but yeah, it would not be good!)

corneliusroemer commented 1 month ago

Hmm does this mean we could be failing to ingest some other quite important data? (Not a criticism - just for understanding, I guess I previously thought we were capturing 100% of data) (and it's a genuine question - maybe this would be the only thing that we are not capturing)

@theosanderson Yeah we only ingest whatever shows up in NCBI Virus output. There's a bunch of stuff that doesn't make it through - it all depends on whatever NCBI Virus parses from the genbank records.

See separate issue I just made #2834