marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
658 stars 179 forks source link

errorRate help? #173

Closed jasonsydes closed 8 years ago

jasonsydes commented 8 years ago

Hi Canu! I've run canu v1.2 to completion once with a ~30X coverage dataset, partially run canu v1.2 on a ~20X coverage dataset (finished correction/trimming stages), and after the v1.3 release, I'm now running canu v1.3 on my ~20X dataset.

I'm trying to figure out whether or not to set errorRate (or some other *errorRate parameter, e.g. utgGraphErrorRate) for my ~20X and ~30X datasets. Various places in the readthedocs documentation suggest that setting errorRate=0.035 might be good for my datasets. However, running 'canu' shows the following message:

The errorRate is not used correctly (we're working on it).  Don't set it
If you want to change the defaults, use the various utg*ErrorRate options.

And I don't recall exactly, but I seem to remember seeing some advice sprinkled throughout the github issues suggesting that you should (or should not?) leave errorRate untouched and let canu figure it out for you?

And canu v1.3 looks like it drops several errorRate parameters:

canu v1.2, errorRate set to 0.035

-- Final error rates before starting pipeline:
--

--   genomeSize          -- 300000000
--   errorRate           -- 0.035
--
--   corOvlErrorRate     -- 0.105
--   obtOvlErrorRate     -- 0.105
--   utgOvlErrorRate     -- 0.105
--
--   obtErrorRate        -- 0.105
--
--   utgGraphErrorRate   -- 0.07
--   utgBubbleErrorRate  -- 0.0875
--   utgMergeErrorRate   -- 0.0525
--   utgRepeatErrorRate  -- 0.07
--
--   cnsErrorRate        -- 0.0875

canu v1.3, errorRate set to 0.035

-- Final error rates before starting pipeline:
--
--   genomeSize          -- 300000000
--   errorRate           -- 0.035
--
--   corOvlErrorRate     -- 0.105
--   obtOvlErrorRate     -- 0.105
--   utgOvlErrorRate     -- 0.105
--
--   obtErrorRate        -- 0.105
--
--   cnsErrorRate        -- 0.105

canu v1.3, errorRate default (0.025)

-- Final error rates before starting pipeline:
--
--   genomeSize          -- 300000000
--   errorRate           -- 0.025
--
--   corOvlErrorRate     -- 0.075
--   obtOvlErrorRate     -- 0.075
--   utgOvlErrorRate     -- 0.075
--
--   obtErrorRate        -- 0.075
--
--   cnsErrorRate        -- 0.075

So the various *ErrorRate parameters (if left to defaults) get set to 3 times of what errorRate is set to?

Finally, the v1.3 Changes refer to things like "auto-set error rate" and "Auto-set MHAP and other parameters based on genome coverage".

In the end, I'm just trying to figure out how to best set errorRate/*ErrorRate parameters (if at all) for my two datasets (and hopefully understand these parameters better), and was hoping you might be able to help lead me in the right direction. There's a bit of contradicting information out there, and it seems that canu is undergoing rapid development changes. For the moment, I'm running canu on a single machine w/ 512MB RAM and 96 threads available.

Thank you for your time!

skoren commented 8 years ago

I would follow the quick-start for low coverage datasets and set the error rate to 0.035.

There used to be more parameters you had to set for low-coverage datasets but 1.3 auto-sets all those parameters now. The error rate is also an upper bound in 1.3 so it will compute the distribution of error rates in your data (after computing out to 3*errorRate) and auto-set the parameters used for constructing unitigs. This is why the other error rates disappeared in 1.3.

jasonsydes commented 8 years ago

Ok, this is great, thank you for the clarification. It's great to know that a) errorRate actually does work correctly now, and b) that 1.3 auto-sets all those other variables. Thank you very much, this helps a lot!!