Closed harsh-shukla closed 4 years ago
Both those assemblies and parameters seem reasonable, I'll just note that 175mb assemblies of ISO1 aren't that much larger than CLR. We typically get a 160mb assembly for the old ISO1 data (e.g.: https://github.com/marbl/canu-regression/blob/master/results/2018-10-19-1728/drosophila_melanogaster.pacbio.p5c3.report).
That doesn't make much sense, it seems almost like you're assembling both haplotypes together. That would explain both the assembly size and the continuity drop. I do see lots of reads with low coverage not present in the other reports which again is consistent with both haplotypes being mixed together (at low coverage). However, the histogram and the coverage don't support this, I'd expect to see a second peak in the histogram. Did you clean up everything from the old run when you re-tried the assembly? Can you share the ISO1 binned reads only to run locally.
The hifi option is still experimental and has some limitations in 1.9 (for example the gfa is not properly updated for the uncompressed space). However, it should produce better assemblies in terms of repeat resolution than pacbio-corrected, at least on complex genomes. I'm not sure there is enough complexity in drosophila for it to matter. The error rate for the hifi is not actually 2.5%, we just compute overlap out to that limit in case they can be corrected. By default it will only use overlaps below 0.03% error, much lower than used with pacbio-corrected. The 24kb library is much noisier than the 19kb library, in my experience, so you may be able to run the 19kb library with an even lower error rate (0%) by adding batOptions="-eg 0.0 eM 0.0 -dg 3 -db 3 -dr 1 -ca 50 -cp 5"
, that might also get rid of the mis-assembly (but I expect it won't work on the 24kb library).
Hi Sergey,
Thank you for your inputs.
For 19 kb library I think I did clear everything before running again. Using the latest release(Canu 1.9) I also binned the reads again. (the statistics were excatly same ; So I am hoping binning was not the issue). I am just worried that there might be some issue with the 19kb library that I am not able to figure out (But again the _-pacbiohifi options worked atleast for the 19kb ISO1 haplotype). I have not tried assembling the 19kb library with -pacbio-corrected options using the Canu 1.9 official release. Should I try that.
Meanwhile I have uploaded the ISO1 haplotype reads on the google drive. Here is the link https://drive.google.com/open?id=17kKvYmX8LUAak4YwI6wzqTmGgA28rGov
Please do let me know if you have any trouble accessing it.
I will run the 19kb library with the -pacbio-hifi and new batOptions
P.S : The illumina data (for trio binning) we have is really really old (2010-2012) and short (50-75bp). Though I have trimmed it aggressively. That should not be an issue.Right?
Best, Harsh
The illumina data is just used to find short (19 mers with drosophila I expect) so as long as you can get reliable 19mers then yes, the data shouldn't be an issue.
I tried your binned reads you shared (md5sum: 34ffcb9c90f5c965ae4f8c9576989ad6) and ran the 1.9 release as:
canu -d asm -p 'asm' 'genomeSize=150m' -pacbio-corrected '/gpfs/gsfs10/users/Phillippy/projects/hifi/dmel_24k_trio/haplotype/haplotype-19k-ISO.fasta.gz'
, I don't see the weird contig sizes you did:
[UNITIGGING/OVERLAPS]
-- category reads % read length feature size or coverage analysis
-- ---------------- ------- ------- ---------------------- ------------------------ --------------------
-- middle-missing 254 0.03 16618.95 +- 4918.58 3396.67 +- 3872.15 (bad trimming)
-- middle-hump 46 0.01 11932.20 +- 5946.59 849.22 +- 1598.61 (bad trimming)
-- no-5-prime 213 0.03 16948.42 +- 5328.72 573.05 +- 1312.85 (bad trimming)
-- no-3-prime 212 0.03 16766.92 +- 5054.80 943.50 +- 2913.30 (bad trimming)
--
-- low-coverage 3409 0.44 16823.09 +- 4911.97 9.32 +- 7.48 (easy to assemble, potential for lower quality consensus)
-- unique 660953 86.23 19523.75 +- 2382.01 107.87 +- 17.84 (easy to assemble, perfect, yay)
-- repeat-cont 19605 2.56 19125.52 +- 2369.08 1036.62 +- 793.83 (potential for consensus errors, no impact on assembly)
-- repeat-dove 691 0.09 24447.12 +- 2293.23 807.28 +- 602.90 (hard to assemble, likely won't assemble correctly or even at all)
--
-- span-repeat 9645 1.26 19640.06 +- 2513.95 7437.71 +- 6060.99 (read spans a large repeat, usually easy to assemble)
-- uniq-repeat-cont 51724 6.75 18967.94 +- 1980.88 (should be uniquely placed, low potential for consensus errors, no impact on assembly)
-- uniq-repeat-dove 9890 1.29 22312.44 +- 2704.27 (will end contigs, potential to misassemble)
-- uniq-anchor 9801 1.28 19753.76 +- 2401.71 9273.98 +- 5800.00 (repeat read, with unique section, probable bad read)
[UNITIGGING/ADJUSTMENT]
-- No report available.
[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
-- contigs: 588 sequences, total length 169225988 bp (including 55 repeats of total length 3418978 bp).
-- bubbles: 0 sequences, total length 0 bp.
-- unassembled: 5561 sequences, total length 97026688 bp.
--
-- Contig sizes based on genome size 150mbp:
--
-- NG (bp) LG (contigs) sum (bp)
-- ---------- ------------ ----------
-- 10 25691339 1 25691339
-- 20 23806050 2 49497389
-- 30 23806050 2 49497389
-- 40 11874210 3 61371599
-- 50 7624419 5 79707659
-- 60 5785509 7 92428693
-- 70 4290099 10 106474616
-- 80 1403683 17 121285253
-- 90 650386 31 135166572
-- 100 104436 95 150002937
-- 110 30296 397 165015465
--
[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
-- contigs: 588 sequences, total length 168900360 bp (including 55 repeats of total length 3425305 bp).
-- bubbles: 0 sequences, total length 0 bp.
-- unassembled: 5561 sequences, total length 97006905 bp.
--
-- Contig sizes based on genome size 150mbp:
--
-- NG (bp) LG (contigs) sum (bp)
-- ---------- ------------ ----------
-- 10 25644098 1 25644098
-- 20 23763191 2 49407289
-- 30 23763191 2 49407289
-- 40 11851751 3 61259040
-- 50 7610464 5 79558771
-- 60 5772787 7 92254467
-- 70 4280387 10 106273365
-- 80 1396434 17 121054319
-- 90 644150 32 135543297
-- 100 97827 99 150087695
-- 110 29273 407 165022497
--
I would guess the issue is ovlMerThreshold=75
option. You have over 100x coverage but this will ignore any k-mer that occurs more than 75 times in the histogram. You can see from the report, this will eliminate the true k-mers in the main peak:
[UNITIGGING/MERS]
--
-- 22-mers Fraction
-- Occurrences NumMers Unique Total
-- 1- 1 0 0.0000 0.0000
-- 2- 2 49595674 ********************************************************************** 0.2371 0.0068
-- 3- 4 27145805 ************************************** 0.3257 0.0106
-- 5- 7 9116624 ************ 0.3891 0.0145
-- 8- 11 2881814 **** 0.4158 0.0171
-- 12- 16 1045551 * 0.4256 0.0185
-- 17- 22 589042 0.4298 0.0194
-- 23- 29 619651 0.4324 0.0201
-- 30- 37 882546 * 0.4354 0.0213
-- 38- 46 1186291 * 0.4398 0.0235
-- 47- 56 1622309 ** 0.4456 0.0270
-- 57- 67 2377744 *** 0.4534 0.0329
-- 68- 79 5305119 ******* 0.4654 0.0438
-- 80- 92 15729302 ********************** 0.4928 0.0733
-- 93- 106 36972416 **************************************************** 0.5737 0.1750
-- 107- 121 37509231 **************************************************** 0.7559 0.4376
-- 122- 137 12596582 ***************** 0.9281 0.7183
-- 138- 154 1349814 * 0.9818 0.8166
-- 155- 172 275195 0.9874 0.8280
-- 173- 191 269613 0.9886 0.8310
-- 192- 211 301182 0.9899 0.8344
-- 212- 232 302795 0.9914 0.8386
-- 233- 254 195943 0.9928 0.8431
-- 255- 277 115304 0.9937 0.8463
-- 278- 301 86041 0.9943 0.8483
-- 302- 326 81351 0.9947 0.8500
-- 327- 352 79899 0.9951 0.8518
-- 353- 379 60519 0.9954 0.8536
-- 380- 407 53985 0.9957 0.8551
-- 408- 436 48699 0.9960 0.8566
-- 437- 466 45755 0.9962 0.8580
-- 467- 497 36056 0.9964 0.8594
-- 498- 529 32446 0.9966 0.8605
-- 530- 562 30803 0.9967 0.8617
-- 563- 596 33441 0.9969 0.8628
-- 597- 631 28681 0.9971 0.8641
-- 632- 667 27122 0.9972 0.8654
-- 668- 704 27014 0.9973 0.8666
-- 705- 742 23261 0.9974 0.8678
-- 743- 781 23480 0.9976 0.8690
-- 782- 821 19975 0.9977 0.8702
That was set for the human dataset in the paper which only had 22x coverage total. In the case here, I'd set it to 500 (or let canu auto-select it, it auto-sets the value to 1500). So I don't think there is any issue with the 19kb or 24kb libraries, both should work.
Hi Sergey,
Thank you so much for trouble shooting this. I will rerun the 19 and 24 kb libraries again.
Also the assembly with -pacbio-hifi is running with the new paramaters batOptions="-eg 0.0 -eM 0.0 -dg 3 -db 3 -dr 1 -ca 50 -cp 5".
I will post an update as soon as I finish all these assemblies.
Best, Harsh
The other libraries ran and gave me descent assemblies
The -pacbio-hifi also worked well. I am in process of comparing these assemblies w.r.t to various metrics.
Thanks for all the help Sergey. I am closing this issue
Best, Harsh
Hi Sergey,
I am trying to assemble these different HiFi libraries (https://www.ncbi.nlm.nih.gov/sra/?term=pacbio+HiFi+A4+X+ISO1) from A4 X ISO1 F1 females (drosophila).
We had some illumina data for A4 and ISO1 strains with which we did trio binning.The trio binning worked quite well.
For 24 kb library the initial depth was around : 184X. And the binning procedure binned A4 - 90.93X and ISO1 - 91.44X For 19 kb library the initial depth was around : 219X . And the binning procedure binned A4 - 107.2 X and ISO1 - 107.07X
I had started the work few weeks back and I used the branch Canu branch v1.9 +366 changes (r9576 7f9aa6485e4ecbed6fa12fa593aad12303650acc) . As per the nature biotech paper I used the following parameters .(I did trimming as well assembly)
1) The 24 kb library did fantastic for both haplotypes (barring two or three mis-assemblies) but it gave us an unexpected genome size of ~174 Mb. Usually for drosphila we get a genome size of 145 Mb or so with the Pacbio CLR or Nanopore. The preliminary results (for now) show that the extra 25-30 Mb are pericentromeric sequences (which the previous assemblies used to collapse). That is great news.I am attaching the report for the 24 kb library for both haplotypes.
2) The 19 kb library was showing some very strange behavior. Though the binning worked well the assemblies were off ( NG50 ~ 1.12 Mb and genome size of ~268 Mb - this can never happen probably).Also when I looked at [UNITIGGING/OVERLAPS] the unique overlaps were only ~ 41 %. I am attaching reports for both haplotypes.
( I repeated the assembly again assuming I did some mistake while running it initially. But I got the same results . I also assembled the 11 kb library but since it had around ~19X per haplotype, I was not expecting great results. But it gave me quite a decent assembly given the low depth (NG50 of ~5Mb). The error rate of 19 kb library is somewhere in between 11kb & 24 kb and the assembly worked well for both 11kb and 24 kb with correctedErrorRate=0.015 and is not working for 19kb library. Any reason why?. Also I mapped the haplotype seperated reads of 19kb library to the the A4 and ISO1 current reference and I saw no bias in coverage (uniform coverage across all major chromosomal arms) )
Also since the official release of Canu 1.9 (a week back) the -pacbio-hifi option started getting supported (Previously I did not try it since I read in one of the issues that it was still experimental). So I decided to give it a try on both 19kb and 24kb libraries. I only did the ISO1 haplotype for testing. The binning statistics were same as the previous assemblies
Version : Canu 1.9 release
For 24 kb library the stats are as follows
The report : 24kb_hifi_par.ISO1A4-haplotype_ISO1.report.txt
For 19kb libraray the stats are as follows
The report : 19kb_hifi_par_ISO1A4-haplotype_ISO1.report.txt
Now it seems the 19kb library is working and it gave us a slightly better assembly as compared to 24 kb one.(Though the NG50 looks better as compared to --pacbio-corrected one . When I looked at the biggest contig ((40 Mb and 31 Mb) ) in both of them , it had joined two big chunks of different chromosomal arms. Also the genome size we are getting is a little less than the previous --pacbio-corrected assemblies). The coverage shown in the canu report is less than the input coverage but I am guessing that is because the reads are homo-polymer compressed. The error rate used for --pacbio-hifi assembly was correctedErrorRate=0.025 . Any particular reason for increasing it from 0.015.