naobservatory / mgs-pipeline

MIT License
4 stars 2 forks source link

alignments: align all reads, not just hvreads #47

Closed jeffkaufman closed 7 months ago

jeffkaufman commented 7 months ago

Some HV reads are not identified by Kraken, so cast a wider net by running Bowtie across all cleaned reads.

We won't use all of these: some Kraken correctly can tell are Human etc which Bowtie (with the HV-only DB we're using) has no insight into, and others have such low alignment scores that they may as well be junk (but we haven't yet decided which score to use as a cutoff). They're small enough, though, that no harm in keeping them.

Note that this uses the newer streaming pattern where we don't write any large files to disk (so we can run more things at once).

Instead of using pysam I now manually parse the .sam file so I can run Bowtie with --no-sq. The useless @SQ header lines make the .sam files enormous, enough that I'm nervous about running many copies of this at once.

simonleandergrimm commented 7 months ago

We won't use all of these: some Kraken correctly can tell are Human etc which Bowtie (with the HV-only DB we're using) has no insight into, and others have such low alignment scores that they may as well be junk (but we haven't yet decided which score to use as a cutoff).

@jeffkaufman can you say more about the part where Bowtie has no insight into some reads belonging to human-infecting viruses? What's the underlying reason or this?

simonleandergrimm commented 7 months ago

Also let me know if you have time to explain the streaming approach, would be happy to better understand this.

jeffkaufman commented 7 months ago

Also let me know if you have time to explain the streaming approach, would be happy to better understand this.

Sure! Why don't you read https://www.jefftk.com/p/process-substitution-without-shell and then if you still have questions we can talk?