Closed jeffkaufman closed 7 months ago
We won't use all of these: some Kraken correctly can tell are Human etc which Bowtie (with the HV-only DB we're using) has no insight into, and others have such low alignment scores that they may as well be junk (but we haven't yet decided which score to use as a cutoff).
@jeffkaufman can you say more about the part where Bowtie has no insight into some reads belonging to human-infecting viruses? What's the underlying reason or this?
Also let me know if you have time to explain the streaming approach, would be happy to better understand this.
Also let me know if you have time to explain the streaming approach, would be happy to better understand this.
Sure! Why don't you read https://www.jefftk.com/p/process-substitution-without-shell and then if you still have questions we can talk?
Some HV reads are not identified by Kraken, so cast a wider net by running Bowtie across all cleaned reads.
We won't use all of these: some Kraken correctly can tell are Human etc which Bowtie (with the HV-only DB we're using) has no insight into, and others have such low alignment scores that they may as well be junk (but we haven't yet decided which score to use as a cutoff). They're small enough, though, that no harm in keeping them.
Note that this uses the newer streaming pattern where we don't write any large files to disk (so we can run more things at once).
Instead of using
pysam
I now manually parse the.sam
file so I can run Bowtie with--no-sq
. The useless@SQ
header lines make the.sam
files enormous, enough that I'm nervous about running many copies of this at once.