ruanjue / wtdbg2

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly
GNU General Public License v3.0
513 stars 94 forks source link

Looking for explanation of a few assembly parameters #215

Closed yeban closed 4 years ago

yeban commented 4 years ago

wtdbg2 produced a good assembly on my Pacbio dataset (50x) using default parameters (-g 450g -x seqel). I was just wondering if I can get a bit more out of the dataset by adjusting a few parameters. But I don't understand what some of the parameters mean:

  1. Is there a way to change the 256 base pair bin size? Would it be the --aln-kmer-sampling option?
  2. What does penalty for bin deviation (--dp-max-var) mean? Is it similar to what is conventionally the penalty for gap extension?
  3. If I wanted to try and separate overlaps between duplicated regions of the genome, should I look at increasing the minimum similarity (-s) or increasing the gap penalty (--dp-penalty-gap)? Should I also consider increasing the gap variation penalty (--dp-penalty-var)?
  4. Lastly, is minimum alignment length (-l) defined by the number of base pairs, k-mers or K-bins? Accordingly, how does it differ from -m option?

Thanks in advance for your thoughts.

ruanjue commented 4 years ago

1, Cannot change BIN size. 2, --dp-max-var restricts the max deviation off diagonal in a gap. 3, -s is more easy to control. 4, -l in bases. -m also in bases, it sums all kmers' length, but removes their overlaps.

Jue

yeban commented 4 years ago

Thanks!