rrwick / Unicycler

hybrid assembly pipeline for bacterial genomes
GNU General Public License v3.0
547 stars 131 forks source link

Default values for optional parameters #5

Closed nekrut closed 7 years ago

nekrut commented 7 years ago

What are the recommended defaults for the following options:

--min_kmer_frac MIN_KMER_FRAC
--max_kmer_frac MAX_KMER_FRAC  
--kmer_count KMER_COUNT 

--start_gene_id START_GENE_ID
--start_gene_cov START_GENE_COV

--min_component_size MIN_COMPONENT_SIZE
--min_dead_end_size MIN_DEAD_END_SIZE

--scores SCORES
--low_score LOW_SCORE
--min_len MIN_LEN
--allowed_overlap ALLOWED_OVERLAP 
--kmer KMER                          
rrwick commented 7 years ago

Anton,

Thanks for reminding me that I had to clean up some of the command line options! I've removed a couple that aren't used anymore and updated the help output such that more defaults are shown. If you pull from GitHub now, you should get these recent fixes.

I tried to make all of the options have intelligent defaults, so I don't think you'll need to change them. But if you're interested, here they are:

--min_kmer_frac MIN_KMER_FRAC         Lowest k-mer size for SPAdes assembly, expressed as a fraction of the read length (default: 0.2)
--max_kmer_frac MAX_KMER_FRAC         Highest k-mer size for SPAdes assembly, expressed as a fraction of the read length (default: 0.95)
--kmer_count KMER_COUNT               Number of k-mer steps to use in SPAdes assembly (default: 10)

--start_gene_id START_GENE_ID         The minimum required BLAST percent identity for a start gene search (default: 90.0)
--start_gene_cov START_GENE_COV       The minimum required BLAST percent coverage for a start gene search (default: 95.0)

--min_component_size MIN_COMPONENT_SIZE  Unbridged graph components smaller than this size (bp) will be removed from the final graph (default: 1000)
--min_dead_end_size MIN_DEAD_END_SIZE    Graph dead ends smaller than this size (bp) will be removed from the final graph (default: 1000)

--scores SCORES                       Comma-delimited string of alignment scores: match, mismatch, gap open, gap extend (default: 3,-6,-5,-2)
--low_score LOW_SCORE                 Score threshold - alignments below this are considered poor (default: set threshold automatically)

Regarding the SPAdes k-mer stuff: The defaults here are 0.2, 0.95 and 10, which means that it will try SPAdes kmers from 20% of the read length to 95% of the read length, using 10 steps. So if the read length was 125 bp, then it will use these k-mers: 25, 43, 59, 73, 83, 93, 101, 107, 113, 119. Some caveats: all k-mers must be odd, the maximum SPAdes k-mer is 127 so the range can't get above that, and I limited the low end to k=11. In most cases the ideal k-mer is near the top of the range, so the k-mer range isn't evenly spaced. Instead, the lower k-mers are more distant (e.g. 25, 43) and the high k-mers are closer together (e.g. 113, 119).

Regarding the --low_score default: Unicycler will determine what the distribution of scores looks like for totally random sequence alignments using the current scoring scheme (as given by --scores). It then sets the threshold to be 5 standard deviations above the mean random score. For example, if aligning random sequences gives a mean score of 59.7 and a stdev of 1.7, then the threshold will be set to 59.7 + 1.7 * 5 = 68.2. Note that this only depends on the scoring scheme, not on your input reads.

Ryan

nekrut commented 7 years ago

many thanks. Unicycler should be on galaxy main (http://usegalaxy.org) in a week or so.