They're near-perfect matches, but they're not really from the genomes Bowtie is mapping them to. The root of the problem is that our Bowtie DB only has human-viral genomes and so isn't able to identify when a read has a pattern that appears in many non-human-viral genomes. Fully fixing this isn't within Bowtie's scope, but we can make a large dent in the problem by ignoring low-complexity sequences since those are much more likely to appear across genomes in a not very meaningful way.
This is what Kraken's DB building does, in mask_low_complexity.sh. Duplicate that approach and run the genomes through NCBI dustmasker before feeding them to bowtie2-build.
If you run the previous version of this script and filter to places where it disagrees with Kraken, it's finding tons of reads like:
They're near-perfect matches, but they're not really from the genomes Bowtie is mapping them to. The root of the problem is that our Bowtie DB only has human-viral genomes and so isn't able to identify when a read has a pattern that appears in many non-human-viral genomes. Fully fixing this isn't within Bowtie's scope, but we can make a large dent in the problem by ignoring low-complexity sequences since those are much more likely to appear across genomes in a not very meaningful way.
This is what Kraken's DB building does, in
mask_low_complexity.sh
. Duplicate that approach and run the genomes through NCBI dustmasker before feeding them to bowtie2-build.This masks 3.3% of bases:
Testing on one Rothman HTP sample, SRR14530891, this removes 28% of alignments.