marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
654 stars 179 forks source link

Long unitigging overlapper step #770

Closed jychoilab closed 6 years ago

jychoilab commented 6 years ago

Hello

Sorry to ask a non-issue related question but I wanted to ask a general question with using canu for plant genome assembly using nanopore data. We have a~100X coverage for our plant using nanopore and I used canu with the following default paramters:

canu -p plant -d plant_default_param genomeSize=380m maxMemory=400g maxThreads=48 corMaxEvidenceErate=0.15 -nanopore-raw plant.fastq.gz

I haven't had problems and right now its on the unitigging overlapper step and there's a couple of jobs that are basically hanging for close to 2 weeks. I'm hoping they will eventually finish at some point but I have another 100X nanopore data from a different individual and was wondering how to avoid the long hang of the overlapper step. I was thinking reads with low quality may be a problem, or non-sense reads those that are tandem repeats (close to 10kb in length) of short K-mers that are not biological but noise generated from the sequencing reaction may be causing the hang? Are there any recommended pre quality control steps one should do on the nanopore reads before using canu to assemble them?

Thank you

skoren commented 6 years ago

This is the same as the trimming jobs running slowly, see issue #521 for suggestions. Your best option is probably to run the assembly with overlapper=mhap utgReAlign=true. I'd let this assembly finish and then run a new one with the above option and compare the results. If the assemblies are similar contiguity and quality, switch to the faster option for your subsequent individuals.

brianwalenz commented 6 years ago

Did it run fine through the trimming step?

overlapper=mhap is essentially the correct solution (and we're going to switch to that at some point), but I'm curious why it's only getting stuck here.

Can you share the *.histogram files in the trimming/0-mercounts and unitigging/0-mercounts directories?

What is the number in *.ms22.estMerThresh.out? Kmers with count above this are not used for seeding overlaps. It could be slow because it's aligning a loooooong overlap, or it could be slow because it's looking too hard for repeat overlaps.

And if not, just close the issue.

jychoilab commented 6 years ago

Hi Brian The trimming step was ok and it was the overlapping step that was the problem. I'll try with the overlapper=mhap option as well to see if it makes a difference.

The number I got for the trimming/0-mercounts/basmati334.ms22.estMerThresh.out: 1950 And unitigging/0-mercounts/basmati334.ms22.estMerThresh.out: 1850

The histogram files are: trimming.ms22.histogram.txt unitigging.ms22.histogram.txt

jychoilab commented 6 years ago

Its been a while but I thought I would share my results with the community on using the overlapper=mhap option that was suggested. So when I use overlapper=mhap option for my plant genome, the assembly time is very quick taking about ~4days to assemble a plant genome thats expected to be 340Mbp. But the assembly stats looked Ok-ish

--   contigs:      3592 sequences, total length 296475883 bp (including 357 repeats of total length 25356880 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  357230 sequences, total length 2607305874 bp.
--
-- Contig sizes based on genome size --
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10      325853            85    38309683
--     20      212771           233    76200279
--     30      150598           447   114134174
--     40      110947           744   152083070
--     50       82361          1144   190001421
--     60       59019          1689   228017353
--     70       39440          2477   266038087

But when I ran canu with default params it look almost 10times as longer ie. about a month but the assembly stats looks as following:

--   contigs:      3057 sequences, total length 356564119 bp (including 301 repeats of total length 5405095 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  77924 sequences, total length 577972789 bp.
--
-- Contig sizes based on genome size --
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10      893221            33    38192837
--     20      657741            83    76573300
--     30      492260           150   114008458
--     40      375851           238   152024830
--     50      292354           352   190032017
--     60      218501           504   228054553
--     70      146478           716   266069700
--     80       83034          1057   304071404
--     90       23487          1893   342007095

Seems like for my genome I can't cut corners and will have to just do it the default way.

skoren commented 6 years ago

Thanks for the update, when you ran the fast option did you use just overlapper=mhap or overlapper=mhap utgReAlign=true? The latter would work better if you didn't already try it.