marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
657 stars 179 forks source link

HiFi (CCS) assemblies with different libraries #1559

Closed harsh-shukla closed 4 years ago

harsh-shukla commented 4 years ago

Hi Sergey,

I am trying to assemble these different HiFi libraries (https://www.ncbi.nlm.nih.gov/sra/?term=pacbio+HiFi+A4+X+ISO1) from A4 X ISO1 F1 females (drosophila).

We had some illumina data for A4 and ISO1 strains with which we did trio binning.The trio binning worked quite well.

For 24 kb library the initial depth was around : 184X. And the binning procedure binned A4 - 90.93X and ISO1 - 91.44X For 19 kb library the initial depth was around : 219X . And the binning procedure binned A4 - 107.2 X and ISO1 - 107.07X

I had started the work few weeks back and I used the branch Canu branch v1.9 +366 changes (r9576 7f9aa6485e4ecbed6fa12fa593aad12303650acc) . As per the nature biotech paper I used the following parameters .(I did trimming as well assembly)

correctedErrorRate=0.015 ovlMerThreshold=75 batOptions="-eg 0.01 -eM 0.01 -dg 6 -db 6 -dr 1 -ca 50 -cp 5" -pacbio-corrected *.fasta

1) The 24 kb library did fantastic for both haplotypes (barring two or three mis-assemblies) but it gave us an unexpected genome size of ~174 Mb. Usually for drosphila we get a genome size of 145 Mb or so with the Pacbio CLR or Nanopore. The preliminary results (for now) show that the extra 25-30 Mb are pericentromeric sequences (which the previous assemblies used to collapse). That is great news.I am attaching the report for the 24 kb library for both haplotypes.

NG50 for A4     -  ~ 15 Mb -   [24kb_ISOA4-haplotype_A4.report.txt](https://github.com/marbl/canu/files/3848422/24kb_ISOA4-haplotype_A4.report.txt)

NG50 for ISO1 -  ~ 21 Mb -   [24kb_ISOA4-haplotypeISO1.report.txt](https://github.com/marbl/canu/files/3848424/24kb_ISOA4-haplotypeISO1.report.txt)

2) The 19 kb library was showing some very strange behavior. Though the binning worked well the assemblies were off ( NG50 ~ 1.12 Mb and genome size of ~268 Mb - this can never happen probably).Also when I looked at [UNITIGGING/OVERLAPS] the unique overlaps were only ~ 41 %. I am attaching reports for both haplotypes.

 [19_kb_ISOA4-haplotype_A4.report.txt](https://github.com/marbl/canu/files/3848468/19_kb_ISOA4-haplotype_A4.report.txt)
 [19_kb_ISOA4-haplotype_ISO1.report.txt](https://github.com/marbl/canu/files/3848469/19_kb_ISOA4-haplotype_ISO1.report.txt)

( I repeated the assembly again assuming I did some mistake while running it initially. But I got the same results . I also assembled the 11 kb library but since it had around ~19X per haplotype, I was not expecting great results. But it gave me quite a decent assembly given the low depth (NG50 of ~5Mb). The error rate of 19 kb library is somewhere in between 11kb & 24 kb and the assembly worked well for both 11kb and 24 kb with correctedErrorRate=0.015 and is not working for 19kb library. Any reason why?. Also I mapped the haplotype seperated reads of 19kb library to the the A4 and ISO1 current reference and I saw no bias in coverage (uniform coverage across all major chromosomal arms) )

Also since the official release of Canu 1.9 (a week back) the -pacbio-hifi option started getting supported (Previously I did not try it since I read in one of the issues that it was still experimental). So I decided to give it a try on both 19kb and 24kb libraries. I only did the ISO1 haplotype for testing. The binning statistics were same as the previous assemblies

Version : Canu 1.9 release

For 24 kb library the stats are as follows

[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
--   contigs:      408 sequences, total length 169481407 bp (including 2 repeats of total length 100002 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  44544 sequences, total length 1066145766 bp.
--
-- Contig sizes based on genome size 140mbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10    31247174             1    31247174
--     20    31247174             1    31247174
--     30    25813104             2    57060278
--     40    25813104             2    57060278
--     50    22941006             3    80001284
--     60    17635616             4    97636900
--     70     8741949             5   106378849
--     80     3314500             7   113530428
--     90     2011703            13   127324900
--    100      945773            22   140343560
--    110      103675            71   154080291
--    120       29669           352   168013156

The report : 24kb_hifi_par.ISO1A4-haplotype_ISO1.report.txt

For 19kb libraray the stats are as follows

[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
--   contigs:      549 sequences, total length 173117432 bp (including 11 repeats of total length 556707 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  128745 sequences, total length 2468852972 bp.
--
-- Contig sizes based on genome size 140mbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10    40318474             1    40318474
--     20    40318474             1    40318474
--     30    25644769             2    65963243
--     40    25644769             2    65963243
--     50    20883269             3    86846512
--     60    20883269             3    86846512
--     70    11794874             4    98641386
--     80     2734839             7   112837657
--     90     1822340            13   126292109
--    100      797076            25   140756361
--    110      110273            79   154014719
--    120       32605           346   168007141 

The report : 19kb_hifi_par_ISO1A4-haplotype_ISO1.report.txt

Now it seems the 19kb library is working and it gave us a slightly better assembly as compared to 24 kb one.(Though the NG50 looks better as compared to --pacbio-corrected one . When I looked at the biggest contig ((40 Mb and 31 Mb) ) in both of them , it had joined two big chunks of different chromosomal arms. Also the genome size we are getting is a little less than the previous --pacbio-corrected assemblies). The coverage shown in the canu report is less than the input coverage but I am guessing that is because the reads are homo-polymer compressed. The error rate used for --pacbio-hifi assembly was correctedErrorRate=0.025 . Any particular reason for increasing it from 0.015.

skoren commented 4 years ago
  1. Both those assemblies and parameters seem reasonable, I'll just note that 175mb assemblies of ISO1 aren't that much larger than CLR. We typically get a 160mb assembly for the old ISO1 data (e.g.: https://github.com/marbl/canu-regression/blob/master/results/2018-10-19-1728/drosophila_melanogaster.pacbio.p5c3.report).

  2. That doesn't make much sense, it seems almost like you're assembling both haplotypes together. That would explain both the assembly size and the continuity drop. I do see lots of reads with low coverage not present in the other reports which again is consistent with both haplotypes being mixed together (at low coverage). However, the histogram and the coverage don't support this, I'd expect to see a second peak in the histogram. Did you clean up everything from the old run when you re-tried the assembly? Can you share the ISO1 binned reads only to run locally.

  3. The hifi option is still experimental and has some limitations in 1.9 (for example the gfa is not properly updated for the uncompressed space). However, it should produce better assemblies in terms of repeat resolution than pacbio-corrected, at least on complex genomes. I'm not sure there is enough complexity in drosophila for it to matter. The error rate for the hifi is not actually 2.5%, we just compute overlap out to that limit in case they can be corrected. By default it will only use overlaps below 0.03% error, much lower than used with pacbio-corrected. The 24kb library is much noisier than the 19kb library, in my experience, so you may be able to run the 19kb library with an even lower error rate (0%) by adding batOptions="-eg 0.0 eM 0.0 -dg 3 -db 3 -dr 1 -ca 50 -cp 5", that might also get rid of the mis-assembly (but I expect it won't work on the 24kb library).

harsh-shukla commented 4 years ago

Hi Sergey,

Thank you for your inputs.

  1. For 19 kb library I think I did clear everything before running again. Using the latest release(Canu 1.9) I also binned the reads again. (the statistics were excatly same ; So I am hoping binning was not the issue). I am just worried that there might be some issue with the 19kb library that I am not able to figure out (But again the _-pacbiohifi options worked atleast for the 19kb ISO1 haplotype). I have not tried assembling the 19kb library with -pacbio-corrected options using the Canu 1.9 official release. Should I try that.

    Meanwhile I have uploaded the ISO1 haplotype reads on the google drive. Here is the link https://drive.google.com/open?id=17kKvYmX8LUAak4YwI6wzqTmGgA28rGov

    Please do let me know if you have any trouble accessing it.

  2. I will run the 19kb library with the -pacbio-hifi and new batOptions

P.S : The illumina data (for trio binning) we have is really really old (2010-2012) and short (50-75bp). Though I have trimmed it aggressively. That should not be an issue.Right?

Best, Harsh

skoren commented 4 years ago
  1. I wouldn't expect much different from the 1.9 official release but it's worth a test to make sure it's not some kind of bug. I'll see what I get from the ISO reads you uploaded as well.

The illumina data is just used to find short (19 mers with drosophila I expect) so as long as you can get reliable 19mers then yes, the data shouldn't be an issue.

skoren commented 4 years ago

I tried your binned reads you shared (md5sum: 34ffcb9c90f5c965ae4f8c9576989ad6) and ran the 1.9 release as: canu -d asm -p 'asm' 'genomeSize=150m' -pacbio-corrected '/gpfs/gsfs10/users/Phillippy/projects/hifi/dmel_24k_trio/haplotype/haplotype-19k-ISO.fasta.gz', I don't see the weird contig sizes you did:

[UNITIGGING/OVERLAPS]
--   category            reads     %          read length        feature size or coverage  analysis
--   ----------------  -------  -------  ----------------------  ------------------------  --------------------
--   middle-missing        254    0.03    16618.95 +- 4918.58       3396.67 +- 3872.15    (bad trimming)
--   middle-hump            46    0.01    11932.20 +- 5946.59        849.22 +- 1598.61    (bad trimming)
--   no-5-prime            213    0.03    16948.42 +- 5328.72        573.05 +- 1312.85    (bad trimming)
--   no-3-prime            212    0.03    16766.92 +- 5054.80        943.50 +- 2913.30    (bad trimming)
--   
--   low-coverage         3409    0.44    16823.09 +- 4911.97          9.32 +- 7.48       (easy to assemble, potential for lower quality consensus)
--   unique             660953   86.23    19523.75 +- 2382.01        107.87 +- 17.84      (easy to assemble, perfect, yay)
--   repeat-cont         19605    2.56    19125.52 +- 2369.08       1036.62 +- 793.83     (potential for consensus errors, no impact on assembly)
--   repeat-dove           691    0.09    24447.12 +- 2293.23        807.28 +- 602.90     (hard to assemble, likely won't assemble correctly or even at all)
--   
--   span-repeat          9645    1.26    19640.06 +- 2513.95       7437.71 +- 6060.99    (read spans a large repeat, usually easy to assemble)
--   uniq-repeat-cont    51724    6.75    18967.94 +- 1980.88                             (should be uniquely placed, low potential for consensus errors, no impact on assembly)
--   uniq-repeat-dove     9890    1.29    22312.44 +- 2704.27                             (will end contigs, potential to misassemble)
--   uniq-anchor          9801    1.28    19753.76 +- 2401.71       9273.98 +- 5800.00    (repeat read, with unique section, probable bad read)

[UNITIGGING/ADJUSTMENT]
-- No report available.

[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
--   contigs:      588 sequences, total length 169225988 bp (including 55 repeats of total length 3418978 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  5561 sequences, total length 97026688 bp.
--
-- Contig sizes based on genome size 150mbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10    25691339             1    25691339
--     20    23806050             2    49497389
--     30    23806050             2    49497389
--     40    11874210             3    61371599
--     50     7624419             5    79707659
--     60     5785509             7    92428693
--     70     4290099            10   106474616
--     80     1403683            17   121285253
--     90      650386            31   135166572
--    100      104436            95   150002937
--    110       30296           397   165015465
--

[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
--   contigs:      588 sequences, total length 168900360 bp (including 55 repeats of total length 3425305 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  5561 sequences, total length 97006905 bp.
--
-- Contig sizes based on genome size 150mbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10    25644098             1    25644098
--     20    23763191             2    49407289
--     30    23763191             2    49407289
--     40    11851751             3    61259040
--     50     7610464             5    79558771
--     60     5772787             7    92254467
--     70     4280387            10   106273365
--     80     1396434            17   121054319
--     90      644150            32   135543297
--    100       97827            99   150087695
--    110       29273           407   165022497
--

I would guess the issue is ovlMerThreshold=75 option. You have over 100x coverage but this will ignore any k-mer that occurs more than 75 times in the histogram. You can see from the report, this will eliminate the true k-mers in the main peak:

[UNITIGGING/MERS]
--
--  22-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1         0                                                                        0.0000 0.0000
--       2-     2  49595674 ********************************************************************** 0.2371 0.0068
--       3-     4  27145805 **************************************                                 0.3257 0.0106
--       5-     7   9116624 ************                                                           0.3891 0.0145
--       8-    11   2881814 ****                                                                   0.4158 0.0171
--      12-    16   1045551 *                                                                      0.4256 0.0185
--      17-    22    589042                                                                        0.4298 0.0194
--      23-    29    619651                                                                        0.4324 0.0201
--      30-    37    882546 *                                                                      0.4354 0.0213
--      38-    46   1186291 *                                                                      0.4398 0.0235
--      47-    56   1622309 **                                                                     0.4456 0.0270
--      57-    67   2377744 ***                                                                    0.4534 0.0329
--      68-    79   5305119 *******                                                                0.4654 0.0438
--      80-    92  15729302 **********************                                                 0.4928 0.0733
--      93-   106  36972416 ****************************************************                   0.5737 0.1750
--     107-   121  37509231 ****************************************************                   0.7559 0.4376
--     122-   137  12596582 *****************                                                      0.9281 0.7183
--     138-   154   1349814 *                                                                      0.9818 0.8166
--     155-   172    275195                                                                        0.9874 0.8280
--     173-   191    269613                                                                        0.9886 0.8310
--     192-   211    301182                                                                        0.9899 0.8344
--     212-   232    302795                                                                        0.9914 0.8386
--     233-   254    195943                                                                        0.9928 0.8431
--     255-   277    115304                                                                        0.9937 0.8463
--     278-   301     86041                                                                        0.9943 0.8483
--     302-   326     81351                                                                        0.9947 0.8500
--     327-   352     79899                                                                        0.9951 0.8518
--     353-   379     60519                                                                        0.9954 0.8536
--     380-   407     53985                                                                        0.9957 0.8551
--     408-   436     48699                                                                        0.9960 0.8566
--     437-   466     45755                                                                        0.9962 0.8580
--     467-   497     36056                                                                        0.9964 0.8594
--     498-   529     32446                                                                        0.9966 0.8605
--     530-   562     30803                                                                        0.9967 0.8617
--     563-   596     33441                                                                        0.9969 0.8628
--     597-   631     28681                                                                        0.9971 0.8641
--     632-   667     27122                                                                        0.9972 0.8654
--     668-   704     27014                                                                        0.9973 0.8666
--     705-   742     23261                                                                        0.9974 0.8678
--     743-   781     23480                                                                        0.9976 0.8690
--     782-   821     19975                                                                        0.9977 0.8702

That was set for the human dataset in the paper which only had 22x coverage total. In the case here, I'd set it to 500 (or let canu auto-select it, it auto-sets the value to 1500). So I don't think there is any issue with the 19kb or 24kb libraries, both should work.

harsh-shukla commented 4 years ago

Hi Sergey,

Thank you so much for trouble shooting this. I will rerun the 19 and 24 kb libraries again.

Also the assembly with -pacbio-hifi is running with the new paramaters batOptions="-eg 0.0 -eM 0.0 -dg 3 -db 3 -dr 1 -ca 50 -cp 5".

I will post an update as soon as I finish all these assemblies.

Best, Harsh

harsh-shukla commented 4 years ago

The other libraries ran and gave me descent assemblies

The -pacbio-hifi also worked well. I am in process of comparing these assemblies w.r.t to various metrics.

Thanks for all the help Sergey. I am closing this issue

Best, Harsh