Closed nekrut closed 7 years ago
Anton,
Thanks for reminding me that I had to clean up some of the command line options! I've removed a couple that aren't used anymore and updated the help output such that more defaults are shown. If you pull from GitHub now, you should get these recent fixes.
I tried to make all of the options have intelligent defaults, so I don't think you'll need to change them. But if you're interested, here they are:
--min_kmer_frac MIN_KMER_FRAC Lowest k-mer size for SPAdes assembly, expressed as a fraction of the read length (default: 0.2)
--max_kmer_frac MAX_KMER_FRAC Highest k-mer size for SPAdes assembly, expressed as a fraction of the read length (default: 0.95)
--kmer_count KMER_COUNT Number of k-mer steps to use in SPAdes assembly (default: 10)
--start_gene_id START_GENE_ID The minimum required BLAST percent identity for a start gene search (default: 90.0)
--start_gene_cov START_GENE_COV The minimum required BLAST percent coverage for a start gene search (default: 95.0)
--min_component_size MIN_COMPONENT_SIZE Unbridged graph components smaller than this size (bp) will be removed from the final graph (default: 1000)
--min_dead_end_size MIN_DEAD_END_SIZE Graph dead ends smaller than this size (bp) will be removed from the final graph (default: 1000)
--scores SCORES Comma-delimited string of alignment scores: match, mismatch, gap open, gap extend (default: 3,-6,-5,-2)
--low_score LOW_SCORE Score threshold - alignments below this are considered poor (default: set threshold automatically)
Regarding the SPAdes k-mer stuff: The defaults here are 0.2, 0.95 and 10, which means that it will try SPAdes kmers from 20% of the read length to 95% of the read length, using 10 steps. So if the read length was 125 bp, then it will use these k-mers: 25, 43, 59, 73, 83, 93, 101, 107, 113, 119
. Some caveats: all k-mers must be odd, the maximum SPAdes k-mer is 127 so the range can't get above that, and I limited the low end to k=11. In most cases the ideal k-mer is near the top of the range, so the k-mer range isn't evenly spaced. Instead, the lower k-mers are more distant (e.g. 25, 43
) and the high k-mers are closer together (e.g. 113, 119
).
Regarding the --low_score
default: Unicycler will determine what the distribution of scores looks like for totally random sequence alignments using the current scoring scheme (as given by --scores
). It then sets the threshold to be 5 standard deviations above the mean random score. For example, if aligning random sequences gives a mean score of 59.7 and a stdev of 1.7, then the threshold will be set to 59.7 + 1.7 * 5 = 68.2. Note that this only depends on the scoring scheme, not on your input reads.
Ryan
many thanks. Unicycler should be on galaxy main (http://usegalaxy.org) in a week or so.
What are the recommended defaults for the following options: