Closed samhunter closed 10 years ago
Simple (tandem) repeat masking in combination with more stringent mapping parameters seem to be helping a lot with this issue (note that the added stringency only works with BLAT, not BOWTIE2). I'd still like to improve this to include some sort of coverage tracking, but the recent improvements in the develop branch are a big step forward.
Because of the "sloppy mapping" approach, targets sometimes pull in a few repetitive regions which then pull in a few more etc, causing big problems for assembly speed and slowing down the whole process. Currently this is partially handled by repeat detection and removal based on % difference in read incorporation from iteration to iteration. Some alternative, smarter approaches to dealing with this might include: 1) More stringent mapping parameters which go into effect after the first iteration. There isn't really any need for "sloppy" mapping once a set of initial contigs has been established. 2) Some sort of a contig composition filtering step to screen low-complexity contigs. This might be as simple as a 2-mer frequency table followed by some outlier detection, or something like a "Dusty score" calculation might work better.