Bowtie: mask low-complexity sequences

If you run the previous version of this script and filter to places where it disagrees with Kraken, it's finding tons of reads like:

>SRR14530891.2510 2510/2
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
>SRR14530891.754706 754706/2
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGACCGGGGGGGGCGGTGTAGAGTAAAAAGGCAGGAGAGGAGAGAAACAAATAAGAAAAAATAGGGTG
>SRR14530891.1990101 1990101/2
CTGCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCGCGCCGCGCCGCGCCGCGGGGGCCCGCCGCGCCCCCCGCGGGGCCGGCCGCCCGCCC
>SRR14530891.4392518 4392518/2
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
>SRR14530891.4780448 4780448/2
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

They're near-perfect matches, but they're not really from the genomes Bowtie is mapping them to. The root of the problem is that our Bowtie DB only has human-viral genomes and so isn't able to identify when a read has a pattern that appears in many non-human-viral genomes. Fully fixing this isn't within Bowtie's scope, but we can make a large dent in the problem by ignoring low-complexity sequences since those are much more likely to appear across genomes in a not very meaningful way.

This is what Kraken's DB building does, in mask_low_complexity.sh. Duplicate that approach and run the genomes through NCBI dustmasker before feeding them to bowtie2-build.

This masks 3.3% of bases:

$ cat masked_genomes.fna | grep -v '^>' |  tr -cd ACTG  | wc -c
1598530039
$ cat masked_genomes.fna | grep -v '^>' | tr -cd x | wc -c
52391459

>>> print("%.1f%%" % (52391459 / 1598530039 * 100))
3.3%

Testing on one Rothman HTP sample, SRR14530891, this removes 28% of alignments.

naobservatory / mgs-pipeline

Bowtie: mask low-complexity sequences #46