marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
652 stars 178 forks source link

Suggestion on how to obtain a less fragmented assembly #254

Closed zhen-fu closed 7 years ago

zhen-fu commented 7 years ago

First of all, thank you for developing such a awesome program, not so demanding on computational resource, i.e compare to Celera or Falcon…

We work on a small organism, where DNA has to be extracted from lots of individuals (~5000 nematodes, they are highly inbred). We have this dataset with three different Pacbio chemistry, XL, C2-P4 and C3-P5. Admittedly, old data, the read length is not that great. However, in total we have over 120 x coverage of a 80-90 Mb genome.

However, no matter how I assemble, testing a range of error rate of 0.013 to 0.025, and all kinds of parameters,corMhapSensitivity, corMinCoverage and corOutCoverage. The best assembly I got so far is with error rate = 0.025 and default setting, which has N50 = 93 K, with ~ 5200 contigs. Should I be happy with the assembly? As I see most of the cases the N50 can reach an order of magnitude higher than mine.

Is there recommended parameters suggested to run on pooled data? Or how can I fine tune it. Thank you very much.

brianwalenz commented 7 years ago

There are a bunch of outputs that we can look at. We only care about the 'unitigging' results for now. Some of these will be in the html outputs, but I suspect that is (still) incomplete.

unitigging/*.ovlStore.summary has a reasonable guess at how well the reads will span repeats. The repeat-dove and uniq-repeat-dove categories are bad, span-repeat is good. And unique is excellent. For comparison, drosophila has 0.27%, 3.33% and 68.73%, respectively.

The other outputs are in the 4-unitigger directory. Before we look in there, you'll need to be using the very latest code, not the 1.3 release. Look at test.001.filterOverlaps.thr000.num000.log for a hint on how much variation is in your reads. You want to see most reads with 'two best edges' in the 'FINAL EDGES'. For error rates, drosophila reported:

median 0.00270000 mad    0.00160000 -> 0.01693296 fraction error =   1.693296% error

Lastly, compare the various *sizes reports. In particular, the 'mergeOrphans' vs 'breakRepeats'. If things get dramatically smaller, repeats are a problem.

Issue #221 is worth reading.

zhen-fu commented 7 years ago

Brian, Thanks a ton for the suggestions and refer me to other links, that is very helpful. Based on your suggestions, I did rerun assembly with the latest code. Below are some answers to your suggestions: "There are a bunch of outputs that we can look at. We only care about the 'unitigging' results for now. Some of these will be in the html outputs, but I suspect that is (still) incomplete. unitigging/*.ovlStore.summary has a reasonable guess at how well the reads will span repeats. The repeat-dove and uniq-repeat-dove categories are bad, span-repeat is good. And unique is excellent. For comparison, drosophila has 0.27%, 3.33% and 68.73%, respectively."

--For my case, repeat-dove, uniq-repeat-dove and span-repeat are 0.07, 2.55, and 23.95% respectively. So majority of the reads were repeat-cont, uniq-repeat-cont and unique.

"The other outputs are in the 4-unitigger directory. Before we look in there, you'll need to be using the very latest code, not the 1.3 release. Look at test.001.filterOverlaps.thr000.num000.log for a hint on how much variation is in your reads. You want to see most reads with 'two best edges' in the 'FINAL EDGES'. For error rates, drosophila reported: median 0.00270000 mad 0.00160000 -> 0.01693296 fraction error = 1.693296% error"

My case: FINAL EDGES 2757943 reads are contained 34165 reads have no best edges (singleton) 55714 reads have only one best edge (spur) 14866 are mutual best 114075 reads have two best edges ## small portion, I guess suggest heterogeneity of the reads? 4328 have one mutual best edge 95727 have two mutual best edges” I do not know for Drosophila data set, what proportion of the two best edges are, should be much higher right?

Error rates from my samples: mean 0.01362926 stddev 0.01038013 -> 0.07591003 fraction error = 7.591003% error median 0.01100000 mad 0.00710000 -> 0.07415876 fraction error = 7.415876% error

"Lastly, compare the various *sizes reports. In particular, the 'mergeOrphans' vs 'breakRepeats'. If things get dramatically smaller, repeats are a problem. Issue #221 is worth reading."

**Output from xx.005.mergeOrphans.sizes: BUBBLE (20 tigs) (56841 length) (2842 average) (0.00x coverage) CONTIGS (13055 tigs) (138369098 length) (10598 average) (1.38x coverage) ng010 234022 lg010 30 sum 10111829 (CONTIGS) ng020 152373 lg020 85 sum 20003508 (CONTIGS) ng030 104216 lg030 166 sum 30046585 (CONTIGS) ng040 76928 lg040 276 sum 40019496 (CONTIGS) ng050 56124 lg050 430 sum 50038015 (CONTIGS) ng060 42441 lg060 636 sum 60034398 (CONTIGS) ng070 30116 lg070 918 sum 70020836 (CONTIGS) ng080 20643 lg080 1314 sum 80004889 (CONTIGS) ng090 13835 lg090 1912 sum 90009808 (CONTIGS) ng100 9206 lg100 2804 sum 100007413 (CONTIGS) ng110 6373 lg110 4118 sum 110000209 (CONTIGS) ng120 4398 lg120 6015 sum 120001650 (CONTIGS) ng130 2817 lg130 8871 sum 130000752 (CONTIGS)

Output from xx.007.breakRepeats.sizes.
BUBBLE (19 tigs) (54952 length) (2892 average) (0.00x coverage) REPEAT (1061 tigs) (5702207 length) (5374 average) (0.06x coverage) CONTIGS (13692 tigs) (137040436 length) (10008 average) (1.37x coverage) ng010 188284 lg010 38 sum 10126422 (CONTIGS) ng020 120548 lg020 104 sum 20028787 (CONTIGS) ng030 90826 lg030 201 sum 30007131 (CONTIGS) ng040 65101 lg040 333 sum 40028916 (CONTIGS) ng050 47592 lg050 513 sum 50030585 (CONTIGS) ng060 34382 lg060 760 sum 60018996 (CONTIGS) ng070 25340 lg070 1100 sum 70010224 (CONTIGS) ng080 17530 lg080 1575 sum 80010518 (CONTIGS) ng090 11935 lg090 2275 sum 90010685 (CONTIGS) ng100 8182 lg100 3292 sum 100001050 (CONTIGS) ng110 5744 lg110 4758 sum 110000832 (CONTIGS) ng120 3978 lg120 6847 sum 120001566 (CONTIGS) ng130 2575 lg130 9966 sum 130002430 (CONTIGS)

I tried to increase the -dg and -db in unitigger.sh following the #issue 221, when I increased the -dg and -db to 15, there was not much change of the Ng050. I am a bit afraid to increase them too much as the error rate was high in my run. So I am guessing the dataset is very fragmented as my reads are from old PacBio chemistry.

Any other suggestions you would provide here? Thanks again for your time!!**

skoren commented 7 years ago

It doesn't look like bogart is splitting the assembly due to repeats. Do you have the histogram outputs from before and after correction. It should be either output to stdout/stderr during the run or canu-scripts if you're running on a cluster. We want to confirm you are ending up with at least 25X after correction and that the read lengths don't degrade too much. It will also give a better idea of what your read length distribution is given that it's a mix of different chemistries.

zhen-fu commented 7 years ago

Hi Sergey, thanks for paying attention to this. Actually I tried assemblies on Canu-1.3 and also the codes updated 2 weeks ago. I found that latter code did not output the histogram before the correction. So the histogram below were from Canu-1.3:

 -- In gatekeeper store   ##### before correction     '/data/cahnrs/entomology/daisy_fu/canu/work/canu_epn7_advanced/all_reads_advanced4/correction/stei.gkpStore':
--   Found 5496219 reads.
--   Found 11005480274 bases (110.05 times coverage).
--
--   Read length histogram (one '*' equals 51930.28 reads):
--        0    999      0 
--     1000   1999 3635120 **********************************************************************
--     2000   2999 1205484 ***********************
--     3000   3999 375633 *******
--     4000   4999 135385 **
--     5000   5999  63056 *
-- In gatekeeper store    #### before trimming '/data/cahnrs/entomology/daisy_fu/canu/work/canu_epn7_advanced/all_reads_advanced4/trimming/stei.gkpStore':
--   Found 5344796 reads.
--   Found 10398889566 bases (103.98 times coverage).
--
--   Read length histogram (one '*' equals 52012.45 reads):
--        0    999      0 
--     1000   1999 3640872 **********************************************************************
--     2000   2999 1120679 *********************
--     3000   3999 342088 ******
--     4000   4999 121927 **
--     5000   5999  55082 *
-- In gatekeeper store            ### before unitigging  '/data/cahnrs/entomology/daisy_fu/canu/work/canu_epn7_advanced/all_reads_advanced4/unitigging/stei.gkpStore':
--   Found 2961897 reads.
--   Found 5342472789 bases (53.42 times coverage).
--
--   Read length histogram (one '*' equals 30732.54 reads):
--        0    999      0 
--     1000   1999 2151278 **********************************************************************
--     2000   2999 594910 *******************
--     3000   3999 144301 ****
--     4000   4999  40122 *

So I think coverage-wise we are good, we did not see much change over the histogram before and after correction. Right ? But admittingly, the chemistries are old, and read length was not great. I found that actually correction with Sprai (http://zombie.cb.k.u-tokyo.ac.jp/sprai/README.html), then trimming and assemble with Canu has generated best results so far, wtih N50 improved ~20%, also fewer missing BUSCOs from the assembly. So I am happy with that.

Last question: what should we do with contigs labeled “suggestRepeat=yes” in the final output? I found most of the contigs with “suggestRepeat=yes” are relatively short. Should we just remove them from the assembly?

Thanks again for such a great program and much support you have provided.

skoren commented 7 years ago

Yeah, though I am surprised that you have so much loss in coverage between trimming and correction, I assume this is using corMinCoverage=0? Given that your longest reads are 5Kb I think you are primarily limited by read length in your assembly. Most repeats which are over 1-2kb won't be resolved with your data.

As for repeat contigs, you should keep them in the assembly. They have higher than expected coverage or were split out by the assembler because they could not be resolved by the reads. They're still part of your genome, they just can't be uniquely placed.

zhen-fu commented 7 years ago

I thought the default corOutCoverage=40, and we should set it higher for higher coverage? Maybe I am wrong?

skoren commented 7 years ago

Yes, corOutCoverage defaults to 40. Increasing it gives more coverage for correction but for high-coverage genomes you shouldn't typically need to increase it.

Sorry I meant corMinCoverage in my previous comment. The trimming is losing about 50% of your coverage which usually happens when you use the low-coverage parameters. It would be concerning if you didn't change corMinCoverage and trimming lost so much data.

zhen-fu commented 7 years ago

I will try it again. thanks for the clarification!