ibest / ARC

Assembly by Reduced Complexity (ARC)
Apache License 2.0
41 stars 5 forks source link

ARC sometimes incorporates a lot of off-target and/or repetative reads #32

Closed samhunter closed 10 years ago

samhunter commented 11 years ago

Because of the "sloppy mapping" approach, targets sometimes pull in a few repetitive regions which then pull in a few more etc, causing big problems for assembly speed and slowing down the whole process. Currently this is partially handled by repeat detection and removal based on % difference in read incorporation from iteration to iteration. Some alternative, smarter approaches to dealing with this might include: 1) More stringent mapping parameters which go into effect after the first iteration. There isn't really any need for "sloppy" mapping once a set of initial contigs has been established. 2) Some sort of a contig composition filtering step to screen low-complexity contigs. This might be as simple as a 2-mer frequency table followed by some outlier detection, or something like a "Dusty score" calculation might work better.

samhunter commented 10 years ago

Simple (tandem) repeat masking in combination with more stringent mapping parameters seem to be helping a lot with this issue (note that the added stringency only works with BLAT, not BOWTIE2). I'd still like to improve this to include some sort of coverage tracking, but the recent improvements in the develop branch are a big step forward.