Closed amit4mchiba closed 4 years ago
Dear Canu Developer,
My run using above mentioned command just finished and the assembly stats are as follows-
-- Finished on Sun Jun 14 19:17:20 2020 (94 seconds) with 14899.957 GB free disk space
----------------------------------------
-- Found, in version 2, after consensus generation:
-- contigs: 9750 sequences, total length 373106650 bp (including 362 repeats of total length 5629021 bp).
-- bubbles: 822 sequences, total length 18200116 bp.
-- unassembled: 89580 sequences, total length 539711020 bp.
--
-- Contig sizes based on genome size 410mbp:
--
-- NG (bp) LG (contigs) sum (bp)
-- ---------- ------------ ----------
-- 10 173394 171 41141083
-- 20 121423 457 82073523
-- 30 90671 849 123064508
-- 40 70068 1366 164043253
-- 50 53390 2036 205016015
-- 60 39996 2924 246019688
-- 70 29127 4122 287010287
-- 80 19556 5833 328019086
-- 90 7320 8804 369002202
--
-- Report changed.
-- Finished stage 'consensusLoad', reset canuIteration.
-- No change in report.
Clearly, with a coverage of 112X, the contig N50 of 53390 bps seems not reasonable. I would be really grateful if you could advise me as how shall I proceed. I am attaching here run report for this assembly attempt.
thanks and regards Amit Canu_Run_06062020_complete_run_report.txt
Fist, NG50 isn't really related to coverage per-se, it's more related to read lengths and genome repetitiveness. However, your corrected and trimmed reads do not look good, they have no clear k-mer peak and as you said, the coverage is quite low.
I think corMaxEvidenceErate is the main issue, you're eliminating most of your overlaps that can be used to correct the reads. The FAQ only recommends the option for genomes with significantly skewed AT/GC content not in general for plants. Unless this plan is <25% GC or > 75% GC that should not be used. You would need to restart the assembly from scratch without that option, you can probably also use corMhapSensitivity=normal given the high coverage as well, that will make the run faster.
Thanks, Koren. I really appreciate your help. I rerun the assembly with just the default parameters, and results are far better for sure. Now, my entire assembly resulted in under 1000 contigs with N50 over 6Mb. At this moment, using Sequel 1 raw Pacbio reads for plants, this is still a decent assembly I think.
I think the next step would be to optimize parameters. Based on the manual, one straight forward way would be to run the entire assembly process, and then one can use trimmed corrected reads for assembling using different corrected error rate parameters. I have a question about that.
For final assembly process, the assembler first identified overlaps between corrected trimmed reads and then filters based on given corrected error rate parameter (That's my assumption, not sure although). So, this means, for each parameter, the assembler would first create an overlap database and then proceed next. Is it possible to use subset of the first run to test different parameters? I thought that if this is possible, it would save a lot of time and computational resources.
thanks and regards Amit
I have another question if you could help me please. Towards end of the assembly process, I saw these outputs- -- Starting command on Fri Jun 26 10:22:50 2020 with 1719694.288 GB free disk space
cd unitigging
/lustre7/home/lustre3/amit-rai8chiba/canu-2.0/Linux-amd64/bin/tgStoreDump \
-S ../GI_24thJune2020_0.045_unitig.seqStore \
-T ./GI_24thJune2020_0.045_unitig.ctgStore 2 \
-sizes -s 410000000 \
> ./GI_24thJune2020_0.045_unitig.ctgStore/seqDB.v002.sizes.txt
sqStore_loadMetadata()-- Using 'corrected-trimmed' 0x10 reads. ZERO length suplied to intervalList::add(), ignoring interval. ZERO length suplied to intervalList::add(), ignoring interval. ZERO length suplied to intervalList::add(), ignoring interval. ZERO length suplied to intervalList::add(), ignoring interval. ZERO length suplied to intervalList::add(), ignoring interval.
Is this some kind of error or do we need to provide some length? I have previously used Canu 1.6, but never saw this message, so, was not sure what it means
You can safely ignore those warning.
You can try varying parameters without re-running correction and trimming, especially if you set a high value for corOutCoverage. If you want to get more corrected reads (e.g. 80x instead of the default 40x) you may need to go back to re-run the correction with increased corMhapSensitivity and corOutCoverage.
Remove this text before submitting your issue!
The Canu command I used was as follows- /mnt/md0/canu-2.0/Linux-amd64/bin/canu -p GI_5thJune2020 -d ./GI_5thJune2020 \ genomeSize=410m \ -pacbio-raw ./pacbio_raw/*.fastq.gz \ useGrid=0 \ corMhapSensitivity=high \ corMinCoverage=0 \ corOutCoverage=200 \ correctedErrorRate=0.085 \ corMaxEvidenceErate=0.15 \ gridOptionsJobName=GI \ maxMemory=196g
And I used Canu2.0 for this run. I have run this command on my Ubuntu server, which has 36 cores and 72 cpus, 512Gb ram, and I run this command on local machine.
I am attaching here run report and progress with this issue. Canu_Run_05062020.txt
The objective is to achieve a high quality plant genome, a species that I am working on is hetrozygous and its relatives have shown to have over 55% of repeat content. For this project, I acquired PacBio Sequel 1 based 112x genome coverage (genome size 410Mb). I followed Canu2.0 mannual, and I decided to first correct as many reads as possible, so that, I can try different correctedErrorRate options and other options mentioned in the manual to fine tune parameters for my plant. As recommended by Canu2.0 manual, I used correctedErrorRate=0.085 (for Sequel 1), and corMaxEvidenceErate=0.15 as the manual says that for plant, it may always be a good idea to keep this option to remove many repeative reads and it will fast the whole process. Other then that, I opted for corMhapSensitivity=high, and corMinCoverage=0 as I wanted to get as much or almost entire reads being corrected.
Running this command, starting with 112x genome coverage, resulted in about 77x corrected reads (which according to manual is fine, although I was expecting more as I used corMinCoverage), but after trimming, only 17.22 coverage left for assembly. Ofcourse, expectations are to get at around 40x best reads. Looking at the overlap store statistics-
I have only 32.57% of unique reads. While this program is still running, I am kind of sure that assembly will not be very good. So, I wanted your advice as where and what I could have done to get better result. Previously, I used 44x genome coverage using Sequel 1, and got a decent assembly, and hence decided to achieve more coverage for my plant.
These are few questions I had-
I am not sure if I was able to provide all details, so kindly let me know if you need any further information to get your suggestions.
thank you so much,
with best regards Amit