Thinking through behavior regarding transgenic and other problematic sequences

There are transgenic viruses in Genbank that have human virus taxids. For example,

https://www.ncbi.nlm.nih.gov/nuccore/MK733609.1 - "Human gammaherpesvirus 8 transgenic clone BAC16, complete genome"
https://www.ncbi.nlm.nih.gov/nuccore/MH144178.1 - "Measles virus genotype A transgenic strain vac2(GFP)H, complete genome"

These genomes have GFP in them, and @jeffkaufman found that blasting transgenic sequences from something with GFP was giving hits to these viruses.

My understanding is that a way the Kraken2 team tries to deal with this is having vector sequences in their database, so that in cases like this, the reads would not get assigned, or maybe get assigned to 'root' (not sure), because they'd match the transgenic viruses with the virus taxid but also the vectors in the vector category.

I think the current HV workflow, if kraken gave these sequences an 'unassigned' result and these transgenic genbank genomes were in the bowtie db, would by design assign them to the HV taxid. I'm wondering, is that logic correct, and if so, how might we best handle it?

A more general issue is that, by using a bowtie database that does not include vectors and contaminant genomes (like the human genome), we might some of the lose the competitive-mapping behavior that kraken with the standard kraken database provides (https://journals.asm.org/doi/full/10.1128/mbio.01607-23). I think how true this is depends on how sequences that map to both 1) a Genbank genome with an HV taxid and 2) human, bacteria, or vector sequence, end up getting assigned by the Kraken2 step. If these sequences get a Kraken assignment to something that isn't in the tax path of the HV, then these reads should be correctly filtered out. But if they get unassigned by Kraken due to being ambiguous across domains, then they would not get filtered and could still get an HV assignment.

Regarding transgenic sequences, this is handled (somewhat clunkily, but effectively in cases like the ones you list) by FILTER_GENOME_FASTA and the built-in list of exclusion patterns, which includes patterns that match the headers of most transgenic sequences.

But if a sequence made it through that filter, then this...

I think the current HV workflow, if kraken gave these sequences an 'unassigned' result and these transgenic genbank genomes were in the bowtie db, would by design assign them to the HV taxid. I'm wondering, is that logic correct, and if so, how might we best handle it?

...would be correct, yes.

A more general issue is that, by using a bowtie database that does not include vectors and contaminant genomes (like the human genome), we might some of the lose the competitive-mapping behavior that kraken with the standard kraken database provides (https://journals.asm.org/doi/full/10.1128/mbio.01607-23). I think how true this is depends on how sequences that map to both 1) a Genbank genome with an HV taxid and 2) human, bacteria, or vector sequence, end up getting assigned by the Kraken2 step. If these sequences get a Kraken assignment to something that isn't in the tax path of the HV, then these reads should be correctly filtered out. But if they get unassigned by Kraken due to being ambiguous across domains, then they would not get filtered and could still get an HV assignment.

This is also correct.

naobservatory / mgs-workflow

Thinking through behavior regarding transgenic and other problematic sequences #77