knights-lab / BURST

An ultrafast optimal aligner for mapping large NGS data to large genome databases.
GNU Affero General Public License v3.0
57 stars 8 forks source link

Complexity filter #23

Closed theo-allnutt-bioinformatics closed 4 years ago

theo-allnutt-bioinformatics commented 4 years ago

Is it possible to add a low complexity filter to Burst?

Thanks,

Theo

GabeAl commented 4 years ago

This is a good idea, Theo.

I would probably add it to SHI7 rather than burst, as it is a read filtering / QC step rather than part of alignment (even though some tools such as BLAST integrate this into the aligner).

I'm just not sure complexity masking is a good idea for all (or even most semi-global) cases though. For instance, in end-to-end query alignment, one often wants to aggregate alignments after a run to calculate coverage of the reference genome(s). If low complexity reads were dropped, this may not be possible as gaps would be introduced in low complexity regions, and if filtering is also applied to the reference genomes, their length and class distribution would be biased as some families are naturally more complex throughout their genome than others. Also, for taxonomy assignment, they may still be of (limited) use as they can contribute to LCA and Bayesian redistribution (even at less informative broad taxonomic levels).

What was the use case you had in mind?

Thanks, Gabe

On Tue, Feb 25, 2020, 11:12 PM Theo Allnutt Bioinformatics < notifications@github.com> wrote:

Is it possible to add a low complexity filter to Burst?

Thanks,

Theo

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/knights-lab/BURST/issues/23?email_source=notifications&email_token=AB5NOBVGAJVFBMGQRSVKEDLREXT23A5CNFSM4K34RIP2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IQJUKGA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5NOBQ3SP6TL7SE2CJ4FALREXT23ANCNFSM4K34RIPQ .

theo-allnutt-bioinformatics commented 4 years ago

Hi Gabe,

what I was thinking was for more grossly simple sequences. Despite normal QC, I still see, e.g. 150 bp of 'A's interspersed with very few other bases. Also long SSRs may cover complete 150bp reads.

Thanks,

Theo