Open miraep8 opened 1 month ago
Related to #70
Could also try another tool to see if it does any better. fastp seems to be getting popular.
Tyler
On Mon, Sep 16, 2024 at 9:49 PM Mirae Baichoo @.***> wrote:
Related to #70 https://github.com/vdblab/vdblab-shotgun/issues/70
— Reply to this email directly, view it on GitHub https://github.com/vdblab/vdblab-shotgun/issues/92#issuecomment-2353782810, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC5EGEYVCGHQRP7GEXTXHDZW4Y25AVCNFSM6AAAAABOJ7IABWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJTG44DEOBRGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
👍 Good suggestion! Worth looking into at least
Following up on a discussion I had with @nickp60 earlier on whether or not we should retune the
bbduk
parameters during trimming (given that we have some reads that look like adapter/empty sequence (GGGG's) making it through our current rounds). Just collecting some notes/suggestions here:Current Status/Notes:
During the preprocessing stage - we currently run
bbduk
with the following parameters to trim/qc reads:minlen=51 qtrim=rl trimq=10 ktrim=r k=31 mink=9 hdist=1 hdist2=1 tpe tbo
Some notes from the docs on what each of these mean:
minlen
:- after any trimming operations reads must be as long as thisqtrim
: can be set tor
l
orrl
.r
will only quality trim the right side,l
the left,rl
for both. Happens after all k-mer based operations.trimq
: quality threshold to trim to using Phred algorithm. (ietrimq
= 10 will trim to Q10)ktrim
: any read matching a reference kmer will be discarded. (r
l
etc refers to the direction in which this is done).k
: what length of "mer" of the reference to store (ie in this case we store all 31-mers of the reference - and try ot match against 31-mers in the sequencing file)mink
: useful for adapters in particular for trimming partial matches of the adapter at the front. Will look for down to lengthmink
from one end. ie whenktrim
=r
andmink
= 8, will look to see if any full k-mers match, then if the first k-1 reads match, k-2 reads ... firstmink
reads match.hdist
: The hamming distance of kmers to consider a matchhdist2
: The hamming distance control for short kmers specified bymink
(because these are shorter - one may want to set a lower hamming distance to control false positives).tpe
: flag to trim both reads to the same length (even if for example adapter was only detected in one of them.tbo
: trim adapters based on pair overlap detection usingbbmerge
( meaning that adapters can be trimmed even without known adapter sequences).Potential Solutions/Followups:
for some of the reads that still seem to contain adapters I think it could be worth looking into the following:
ktrim
=r
andhdist
= 1 these mismatches mean that we don't properly catch these as adapters? The solution to this might be similar to the below, or maybe we could consider doing a first pass quality trim on the left and right and then call bbduk to do the kmer based trimming if the first few reads are of low quality.mink
to offset false positives?Could also see if using entropy filtering would help in this situation (ie in catching the degenerate stretches of Gs)