Closed jychoilab closed 6 years ago
This is the same as the trimming jobs running slowly, see issue #521 for suggestions. Your best option is probably to run the assembly with overlapper=mhap utgReAlign=true
. I'd let this assembly finish and then run a new one with the above option and compare the results. If the assemblies are similar contiguity and quality, switch to the faster option for your subsequent individuals.
Did it run fine through the trimming step?
overlapper=mhap
is essentially the correct solution (and we're going to switch to that at some point), but I'm curious why it's only getting stuck here.
Can you share the *.histogram files in the trimming/0-mercounts and unitigging/0-mercounts directories?
What is the number in *.ms22.estMerThresh.out? Kmers with count above this are not used for seeding overlaps. It could be slow because it's aligning a loooooong overlap, or it could be slow because it's looking too hard for repeat overlaps.
And if not, just close the issue.
Hi Brian The trimming step was ok and it was the overlapping step that was the problem. I'll try with the overlapper=mhap option as well to see if it makes a difference.
The number I got for the trimming/0-mercounts/basmati334.ms22.estMerThresh.out: 1950 And unitigging/0-mercounts/basmati334.ms22.estMerThresh.out: 1850
The histogram files are: trimming.ms22.histogram.txt unitigging.ms22.histogram.txt
Its been a while but I thought I would share my results with the community on using the overlapper=mhap
option that was suggested.
So when I use overlapper=mhap
option for my plant genome, the assembly time is very quick taking about ~4days to assemble a plant genome thats expected to be 340Mbp. But the assembly stats looked Ok-ish
-- contigs: 3592 sequences, total length 296475883 bp (including 357 repeats of total length 25356880 bp).
-- bubbles: 0 sequences, total length 0 bp.
-- unassembled: 357230 sequences, total length 2607305874 bp.
--
-- Contig sizes based on genome size --
-- NG (bp) LG (contigs) sum (bp)
-- ---------- ------------ ----------
-- 10 325853 85 38309683
-- 20 212771 233 76200279
-- 30 150598 447 114134174
-- 40 110947 744 152083070
-- 50 82361 1144 190001421
-- 60 59019 1689 228017353
-- 70 39440 2477 266038087
But when I ran canu with default params it look almost 10times as longer ie. about a month but the assembly stats looks as following:
-- contigs: 3057 sequences, total length 356564119 bp (including 301 repeats of total length 5405095 bp).
-- bubbles: 0 sequences, total length 0 bp.
-- unassembled: 77924 sequences, total length 577972789 bp.
--
-- Contig sizes based on genome size --
-- NG (bp) LG (contigs) sum (bp)
-- ---------- ------------ ----------
-- 10 893221 33 38192837
-- 20 657741 83 76573300
-- 30 492260 150 114008458
-- 40 375851 238 152024830
-- 50 292354 352 190032017
-- 60 218501 504 228054553
-- 70 146478 716 266069700
-- 80 83034 1057 304071404
-- 90 23487 1893 342007095
Seems like for my genome I can't cut corners and will have to just do it the default way.
Thanks for the update, when you ran the fast option did you use just overlapper=mhap
or overlapper=mhap utgReAlign=true
? The latter would work better if you didn't already try it.
Hello
Sorry to ask a non-issue related question but I wanted to ask a general question with using canu for plant genome assembly using nanopore data. We have a~100X coverage for our plant using nanopore and I used canu with the following default paramters:
canu -p plant -d plant_default_param genomeSize=380m maxMemory=400g maxThreads=48 corMaxEvidenceErate=0.15 -nanopore-raw plant.fastq.gz
I haven't had problems and right now its on the unitigging overlapper step and there's a couple of jobs that are basically hanging for close to 2 weeks. I'm hoping they will eventually finish at some point but I have another 100X nanopore data from a different individual and was wondering how to avoid the long hang of the overlapper step. I was thinking reads with low quality may be a problem, or non-sense reads those that are tandem repeats (close to 10kb in length) of short K-mers that are not biological but noise generated from the sequencing reaction may be causing the hang? Are there any recommended pre quality control steps one should do on the nanopore reads before using canu to assemble them?
Thank you