Trimming: parameter tuning?

miraep8 commented 1 month ago

Following up on a discussion I had with @nickp60 earlier on whether or not we should retune the bbduk parameters during trimming (given that we have some reads that look like adapter/empty sequence (GGGG's) making it through our current rounds). Just collecting some notes/suggestions here:

Current Status/Notes:

During the preprocessing stage - we currently run bbduk with the following parameters to trim/qc reads:

minlen=51 qtrim=rl trimq=10 ktrim=r k=31 mink=9 hdist=1 hdist2=1 tpe tbo

Some notes from the docs on what each of these mean:

minlen :- after any trimming operations reads must be as long as this
qtrim: can be set to r l or rl. r will only quality trim the right side, l the left, rl for both. Happens after all k-mer based operations.
trimq: quality threshold to trim to using Phred algorithm. (ie trimq = 10 will trim to Q10)
ktrim: any read matching a reference kmer will be discarded. (r l etc refers to the direction in which this is done).
k: what length of "mer" of the reference to store (ie in this case we store all 31-mers of the reference - and try ot match against 31-mers in the sequencing file)
mink: useful for adapters in particular for trimming partial matches of the adapter at the front. Will look for down to length mink from one end. ie when ktrim = r and mink = 8, will look to see if any full k-mers match, then if the first k-1 reads match, k-2 reads ... first mink reads match.
hdist: The hamming distance of kmers to consider a match
hdist2: The hamming distance control for short kmers specified by mink (because these are shorter - one may want to set a lower hamming distance to control false positives).
tpe: flag to trim both reads to the same length (even if for example adapter was only detected in one of them.
tbo: trim adapters based on pair overlap detection using bbmerge ( meaning that adapters can be trimmed even without known adapter sequences).

Potential Solutions/Followups:

for some of the reads that still seem to contain adapters I think it could be worth looking into the following:
- map the adapter sequence to these reads and see where the regions of overlap fall - if we for example have some reads at the beginning of the sequence that don't match the adapter it is possible that with ktrim = r and hdist = 1 these mismatches mean that we don't properly catch these as adapters? The solution to this might be similar to the below, or maybe we could consider doing a first pass quality trim on the left and right and then call bbduk to do the kmer based trimming if the first few reads are of low quality.
- If not - count what the hamming distance actually is for all k-mers starting from the right-most position - consider using a lower hdist2 value? (potentially combined with a higher mink to offset false positives?
- If it is a consistent error in the adapter - maybe this is a real artifact we should include in the reference? (or if consisent error in a particular region - maybe we could just create some fake adapters that sample sequence space for example at the front of the read if that is where the issue is).
Could also see if using entropy filtering would help in this situation (ie in catching the degenerate stretches of Gs)

miraep8 commented 1 month ago

Related to #70

funnell commented 1 month ago

Could also try another tool to see if it does any better. fastp seems to be getting popular.

 Tyler

On Mon, Sep 16, 2024 at 9:49 PM Mirae Baichoo @.***> wrote:

Related to #70 https://github.com/vdblab/vdblab-shotgun/issues/70

— Reply to this email directly, view it on GitHub https://github.com/vdblab/vdblab-shotgun/issues/92#issuecomment-2353782810, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC5EGEYVCGHQRP7GEXTXHDZW4Y25AVCNFSM6AAAAABOJ7IABWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJTG44DEOBRGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

miraep8 commented 1 month ago

👍 Good suggestion! Worth looking into at least

vdblab / vdblab-shotgun

Trimming: parameter tuning? #92

Current Status/Notes:

Potential Solutions/Followups: