Closed janvanoeveren closed 6 years ago
--
-- In gatekeeper store 'correction/S245.gkpStore':
-- Found 38047697 reads.
-- Found 403,673,836,468 bases (25.22 times coverage).
--
-- Read length histogram (one '*' equals 152629.08 reads):
-- 0 4999 10684036 **********************************************************************
-- 5000 9999 10039231 *****************************************************************
-- 10000 14999 6893564 *********************************************
-- 15000 19999 5896892 **************************************
-- 20000 24999 2988107 *******************
-- 25000 29999 1039064 ******
-- 30000 34999 344323 **
-- 35000 39999 114081
-- 40000 44999 34098
-- 45000 49999 10092
-- 50000 54999 2948
-- 55000 59999 794
-- 60000 64999 228
-- 65000 69999 98
-- 70000 74999 53
-- 75000 79999 34
-- 80000 84999 18
-- 85000 89999 10
-- 90000 94999 9
-- 95000 99999 3
-- 100000 104999 4
-- 105000 109999 4
-- 110000 114999 1
-- 115000 119999 0
-- 120000 124999 0
-- 125000 129999 2
-- 130000 134999 1
-- 135000 139999 0
-- 140000 144999 0
-- 145000 149999 1
-- 150000 154999 0
-- 155000 159999 0
-- 160000 164999 0
-- 165000 169999 0
-- 170000 174999 0
-- 175000 179999 0
-- 180000 184999 0
-- 185000 189999 0
-- 190000 194999 1
[CORRECTION/MERS]
--
-- 16-mers Fraction
-- Occurrences NumMers Unique Total
-- 1- 1 13502 0.0000 0.0000
-- 2- 2 58981 0.0000 0.0000
-- 3- 4 612639 0.0001 0.0000
-- 5- 7 4916517 ** 0.0007 0.0001
-- 8- 11 23369926 ********* 0.0043 0.0002
-- 12- 16 63849014 ************************** 0.0180 0.0009
-- 17- 22 112864916 ********************************************** 0.0510 0.0036
-- 23- 29 150070237 ************************************************************* 0.1055 0.0096
-- 30- 37 168375992 ********************************************************************* 0.1757 0.0197
-- 38- 46 170742609 ********************************************************************** 0.2534 0.0340
-- 47- 56 162947190 ****************************************************************** 0.3317 0.0519
-- 57- 67 149860816 ************************************************************* 0.4064 0.0727
-- 68- 79 134602847 ******************************************************* 0.4750 0.0956
-- 80- 92 119042845 ************************************************ 0.5366 0.1200
-- 93- 106 104304966 ****************************************** 0.5912 0.1453
-- 107- 121 90937395 ************************************* 0.6390 0.1708
-- 122- 137 79114549 ******************************** 0.6808 0.1964
-- 138- 154 68754925 **************************** 0.7172 0.2216
-- 155- 172 59746951 ************************ 0.7488 0.2463
-- 173- 191 51966684 ********************* 0.7763 0.2704
-- 192- 211 45239359 ****************** 0.8003 0.2937
-- 212- 232 39485057 **************** 0.8212 0.3162
-- 233- 254 34565616 ************** 0.8394 0.3378
-- 255- 277 30344262 ************ 0.8553 0.3586
-- 278- 301 26733489 ********** 0.8694 0.3785
-- 302- 326 23638409 ********* 0.8817 0.3976
-- 327- 352 20967790 ******** 0.8927 0.4160
-- 353- 379 18630927 ******* 0.9024 0.4335
-- 380- 407 16595182 ****** 0.9110 0.4504
-- 408- 436 14817086 ****** 0.9187 0.4665
-- 437- 466 13263561 ***** 0.9256 0.4820
-- 467- 497 11896250 **** 0.9317 0.4968
-- 498- 529 10698137 **** 0.9372 0.5110
-- 530- 562 9639556 *** 0.9422 0.5246
-- 563- 596 8710691 *** 0.9466 0.5376
-- 597- 631 7880595 *** 0.9507 0.5501
-- 632- 667 7145741 ** 0.9543 0.5620
-- 668- 704 6487045 ** 0.9577 0.5735
-- 705- 742 5904586 ** 0.9607 0.5845
-- 743- 781 5384402 ** 0.9634 0.5951
-- 782- 821 4915830 ** 0.9659 0.6053
--
-- 184027607 (max occurrences)
-- 403103107511 (total mers, non-unique)
-- 2147501366 (distinct mers, non-unique)
-- 13502 (unique mers)
[CORRECTION/CORRECTIONS]
--
-- Reads to be corrected:
-- 37535364 reads longer than 0 bp
-- 397623746990 bp
-- Expected corrected reads:
-- 37535364 reads
-- 132550121930 bp
-- 0 bp minimum length
-- 3531 bp mean length
-- 28841 bp n50 length
--
-- In gatekeeper store 'trimming/S245.gkpStore':
-- Found 5837622 reads.
-- Found 50384133956 bases (3.14 times coverage).
--
-- Read length histogram (one '*' equals 7793.08 reads):
-- 0 999 0
-- 1000 1999 545516 **********************************************************************
-- 2000 2999 492982 ***************************************************************
-- 3000 3999 465366 ***********************************************************
-- 4000 4999 445631 *********************************************************
-- 5000 5999 428239 ******************************************************
-- 6000 6999 408010 ****************************************************
-- 7000 7999 379438 ************************************************
-- 8000 8999 345839 ********************************************
-- 9000 9999 308919 ***************************************
-- 10000 10999 272736 **********************************
-- 11000 11999 239700 ******************************
-- 12000 12999 212267 ***************************
-- 13000 13999 190739 ************************
-- 14000 14999 174986 **********************
-- 15000 15999 162431 ********************
-- 16000 16999 150965 *******************
-- 17000 17999 133458 *****************
-- 18000 18999 112792 **************
-- 19000 19999 91562 ***********
-- 20000 20999 71094 *********
-- 21000 21999 53768 ******
-- 22000 22999 40438 *****
-- 23000 23999 29682 ***
-- 24000 24999 21979 **
-- 25000 25999 16258 **
-- 26000 26999 11685 *
-- 27000 27999 8455 *
-- 28000 28999 6124
-- 29000 29999 4477
-- 30000 30999 3306
-- 31000 31999 2404
-- 32000 32999 1730
-- 33000 33999 1289
-- 34000 34999 982
-- 35000 35999 657
-- 36000 36999 517
-- 37000 37999 359
-- 38000 38999 245
-- 39000 39999 165
-- 40000 40999 125
-- 41000 41999 98
-- 42000 42999 52
-- 43000 43999 51
-- 44000 44999 31
-- 45000 45999 20
-- 46000 46999 23
-- 47000 47999 11
-- 48000 48999 8
-- 49000 49999 1
-- 50000 50999 3
-- 51000 51999 3
-- 52000 52999 2
-- 53000 53999 0
-- 54000 54999 2
-- 55000 55999 0
-- 56000 56999 0
-- 57000 57999 0
-- 58000 58999 0
-- 59000 59999 1
-- 60000 60999 0
-- 61000 61999 0
-- 62000 62999 0
-- 63000 63999 0
-- 64000 64999 0
-- 65000 65999 0
-- 66000 66999 0
-- 67000 67999 0
-- 68000 68999 1
[TRIMMING/MERS]
--
-- 22-mers Fraction
-- Occurrences NumMers Unique Total
-- 1- 1 6924284039 *******************************************************************--> 0.5866 0.1378
-- 2- 2 1249562859 **************************************************************** 0.6925 0.1875
-- 3- 4 1363190979 ********************************************************************** 0.7575 0.2333
-- 5- 7 1160121650 *********************************************************** 0.8487 0.3285
-- 8- 11 612974142 ******************************* 0.9255 0.4516
-- 12- 16 218190666 *********** 0.9640 0.5434
-- 17- 22 91916905 **** 0.9785 0.5929
-- 23- 29 52253673 ** 0.9853 0.6250
-- 30- 37 33406215 * 0.9894 0.6504
-- 38- 46 22667880 * 0.9920 0.6716
-- 47- 56 16004912 0.9938 0.6899
-- 57- 67 11675011 0.9951 0.7058
-- 68- 79 8719891 0.9961 0.7198
-- 80- 92 6612294 0.9968 0.7323
-- 93- 106 5072105 0.9973 0.7434
-- 107- 121 3956379 0.9978 0.7533
-- 122- 137 3144889 0.9981 0.7621
-- 138- 154 2557903 0.9983 0.7701
-- 155- 172 2093773 0.9986 0.7775
-- 173- 191 1731618 0.9987 0.7842
-- 192- 211 1452283 0.9989 0.7904
-- 212- 232 1223406 0.9990 0.7962
-- 233- 254 1047301 0.9991 0.8016
-- 255- 277 905209 0.9992 0.8066
-- 278- 301 785632 0.9993 0.8114
-- 302- 326 684016 0.9993 0.8159
-- 327- 352 597028 0.9994 0.8201
-- 353- 379 525739 0.9994 0.8241
-- 380- 407 465789 0.9995 0.8279
-- 408- 436 413197 0.9995 0.8316
-- 437- 466 370323 0.9996 0.8350
-- 467- 497 333767 0.9996 0.8383
-- 498- 529 300401 0.9996 0.8415
-- 530- 562 277376 0.9996 0.8446
-- 563- 596 255442 0.9997 0.8476
-- 597- 631 236696 0.9997 0.8505
-- 632- 667 217767 0.9997 0.8534
-- 668- 704 200751 0.9997 0.8562
-- 705- 742 192727 0.9997 0.8590
-- 743- 781 187656 0.9998 0.8617
-- 782- 821 177915 0.9998 0.8646
--
-- 4489071 (max occurrences)
-- 43337259840 (total mers, non-unique)
-- 4879144506 (distinct mers, non-unique)
-- 6924284039 (unique mers)
[TRIMMING/TRIMMING]
-- PARAMETERS:
-- ----------
-- 1000 (reads trimmed below this many bases are deleted)
-- 0.0300 (use overlaps at or below this fraction error)
-- 1 (break region if overlap is less than this long, for 'largest covered' algorithm)
-- 1 (break region if overlap coverage is less than this many read, for 'largest covered' algorithm)
--
-- INPUT READS:
-- -----------
-- 5837622 reads 50384133956 bases (reads processed)
-- 0 reads 0 bases (reads not processed, previously deleted)
-- 0 reads 0 bases (reads not processed, in a library where trimming isn't allowed)
--
-- OUTPUT READS:
-- ------------
-- 4726414 reads 33664301899 bases (trimmed reads output)
-- 228287 reads 1810560017 bases (reads with no change, kept as is)
-- 418017 reads 1544988989 bases (reads with no overlaps, deleted)
-- 464904 reads 2372378883 bases (reads with short trimmed length, deleted)
--
-- TRIMMING DETAILS:
-- ----------------
-- 4151395 reads 5873957073 bases (bases trimmed from the 5' end of a read)
-- 4346958 reads 5117947095 bases (bases trimmed from the 3' end of a read)
[TRIMMING/SPLITTING]
-- PARAMETERS:
-- ----------
-- 1000 (reads trimmed below this many bases are deleted)
-- 0.0300 (use overlaps at or below this fraction error)
-- INPUT READS:
-- -----------
-- 4954701 reads 46466766084 bases (reads processed)
-- 882921 reads 3917367872 bases (reads not processed, previously deleted)
-- 0 reads 0 bases (reads not processed, in a library where trimming isn't allowed)
--
-- PROCESSED:
-- --------
-- 0 reads 0 bases (no overlaps)
-- 80265 reads 556668360 bases (no coverage after adjusting for trimming done already)
-- 0 reads 0 bases (processed for chimera)
-- 0 reads 0 bases (processed for spur)
-- 4874436 reads 45910097724 bases (processed for subreads)
--
-- READS WITH SIGNALS:
-- ------------------
-- 0 reads 0 signals (number of 5' spur signal)
-- 0 reads 0 signals (number of 3' spur signal)
-- 0 reads 0 signals (number of chimera signal)
-- 1097 reads 1106 signals (number of subread signal)
--
-- SIGNALS:
-- -------
-- 0 reads 0 bases (size of 5' spur signal)
-- 0 reads 0 bases (size of 3' spur signal)
-- 0 reads 0 bases (size of chimera signal)
-- 1106 reads 430611 bases (size of subread signal)
--
-- TRIMMING:
-- --------
-- 611 reads 2287734 bases (trimmed from the 5' end of the read)
-- 487 reads 1831761 bases (trimmed from the 3' end of the read)
[UNITIGGING/READS]
--
-- In gatekeeper store 'unitigging/S245.gkpStore':
-- Found 4954696 reads.
-- Found 35470738030 bases (2.21 times coverage).
--
-- Read length histogram (one '*' equals 11499.15 reads):
-- 0 999 0
-- 1000 1999 804941 **********************************************************************
-- 2000 2999 571379 *************************************************
-- 3000 3999 469253 ****************************************
-- 4000 4999 412759 ***********************************
-- 5000 5999 367490 *******************************
-- 6000 6999 326007 ****************************
-- 7000 7999 283755 ************************
-- 8000 8999 246145 *********************
-- 9000 9999 213307 ******************
-- 10000 10999 184946 ****************
-- 11000 11999 159770 *************
-- 12000 12999 139026 ************
-- 13000 13999 121889 **********
-- 14000 14999 109927 *********
-- 15000 15999 100375 ********
-- 16000 16999 90883 *******
-- 17000 17999 79396 ******
-- 18000 18999 65755 *****
-- 19000 19999 52790 ****
-- 20000 20999 40967 ***
-- 21000 21999 30431 **
-- 22000 22999 23025 **
-- 23000 23999 16618 *
-- 24000 24999 12340 *
-- 25000 25999 9021
-- 26000 26999 6331
-- 27000 27999 4607
-- 28000 28999 3223
-- 29000 29999 2312
-- 30000 30999 1662
-- 31000 31999 1232
-- 32000 32999 889
-- 33000 33999 669
-- 34000 34999 477
-- 35000 35999 326
-- 36000 36999 250
-- 37000 37999 165
-- 38000 38999 102
-- 39000 39999 75
-- 40000 40999 60
-- 41000 41999 39
-- 42000 42999 28
-- 43000 43999 13
-- 44000 44999 11
-- 45000 45999 13
-- 46000 46999 5
-- 47000 47999 6
-- 48000 48999 2
-- 49000 49999 1
-- 50000 50999 1
-- 51000 51999 2
[UNITIGGING/MERS]
--
-- 22-mers Fraction
-- Occurrences NumMers Unique Total
-- 1- 1 3264009111 *******************************************************************--> 0.4672 0.0923
-- 2- 2 828616052 ******************************************************* 0.5858 0.1391
-- 3- 4 1054097255 ********************************************************************** 0.6685 0.1881
-- 5- 7 960518507 *************************************************************** 0.7932 0.2978
-- 8- 11 507767805 ********************************* 0.9013 0.4436
-- 12- 16 172219029 *********** 0.9548 0.5506
-- 17- 22 68782627 **** 0.9739 0.6054
-- 23- 29 38248189 ** 0.9824 0.6394
-- 30- 37 24091289 * 0.9874 0.6657
-- 38- 46 16202868 * 0.9906 0.6874
-- 47- 56 11313797 0.9928 0.7059
-- 57- 67 8203087 0.9944 0.7219
-- 68- 79 6067241 0.9955 0.7359
-- 80- 92 4576255 0.9963 0.7482
-- 93- 106 3480468 0.9970 0.7591
-- 107- 121 2683573 0.9975 0.7687
-- 122- 137 2142267 0.9978 0.7773
-- 138- 154 1726548 0.9981 0.7850
-- 155- 172 1405949 0.9984 0.7921
-- 173- 191 1167840 0.9986 0.7985
-- 192- 211 975937 0.9988 0.8044
-- 212- 232 821724 0.9989 0.8100
-- 233- 254 704823 0.9990 0.8151
-- 255- 277 606566 0.9991 0.8199
-- 278- 301 524276 0.9992 0.8244
-- 302- 326 457700 0.9993 0.8287
-- 327- 352 396956 0.9993 0.8327
-- 353- 379 346413 0.9994 0.8365
-- 380- 407 305412 0.9994 0.8401
-- 408- 436 271362 0.9995 0.8435
-- 437- 466 244477 0.9995 0.8467
-- 467- 497 217335 0.9996 0.8498
-- 498- 529 194029 0.9996 0.8528
-- 530- 562 179652 0.9996 0.8556
-- 563- 596 163727 0.9996 0.8584
-- 597- 631 153704 0.9997 0.8610
-- 632- 667 141263 0.9997 0.8637
-- 668- 704 134764 0.9997 0.8663
-- 705- 742 132363 0.9997 0.8689
-- 743- 781 134921 0.9997 0.8716
-- 782- 821 127945 0.9998 0.8745
--
-- 3189190 (max occurrences)
-- 32102680291 (total mers, non-unique)
-- 3722071131 (distinct mers, non-unique)
-- 3264009111 (unique mers)```
There should be more information in the Canu logs (# of reads without overlaps/etc, something like this:
-- READS:
-- -----
--
-- 109013 (no overlaps)
-- 4070150 (no overlaps filtered)
-- 2393499 (< 50% overlaps filtered)
-- 5094030 (< 80% overlaps filtered)
-- 9712974 (< 95% overlaps filtered)
-- 28492431 (< 100% overlaps filtered)
That would give an idea if the sensitivity parameters need to be adjusted and if corMinCoverage=0
would help. There should also be info on what parameters mhap used:
-- OVERLAPPER (mhap) (correction)
--
--
-- PARAMETERS: hashes=512, minMatches=3, threshold=0.78
The reason the coverage is reduced from 25 to 8x is because most of the reads don't have 4-fold (the default minimum) coverage across their full length so they are being trimmed. corMinCoverage would help this but if the reads are very low quality still they could be trimmed during the trimming step.
Another observation from your logs, the k-mer distribution looks suspicious. You have 25x coverage yet the peak of the distribution looks to be around 40. I wouldn't expect this to be higher than your coverage, which makes me suspect some contaminant in the data.
Thanks
-- OVERLAPS:
-- --------
--
-- IGNORED:
--
-- 0 (< 0.0000 fraction error)
-- 0 (> 0.4095 fraction error)
-- 0 (< 0 bases long)
-- 0 (> 2097151 bases long)
--
-- FILTERED:
--
-- 836777765067 (too many overlaps, discard these shortest ones)
--
-- EVIDENCE:
--
-- 1206913429 (longest overlaps)
--
-- TOTAL:
--
-- 837984678496 (all overlaps)
--
-- READS:
-- -----
--
-- 512333 (no overlaps)
-- 11541183 (no overlaps filtered)
-- 3543081 (< 50% overlaps filtered)
-- 8287523 (< 80% overlaps filtered)
-- 12783144 (< 95% overlaps filtered)
-- 25994181 (< 100% overlaps filtered)
and
-- OVERLAPPER (mhap) (correction)
--
-- Set corMhapSensitivity=high based on read coverage of 25.
--
-- PARAMETERS: hashes=768, minMatches=2, threshold=0.73
about the 16-mer distribution: I've plotted the data using bins of 1 only, and this show that the top is actually at 28x:
The overlap stats look ok so corMinCoverage=0 should help. You can do the same remove the 2-correction folder and asm.correctedReads.fasta.gz (along with trimming/unitgging folders) and see how that changes the corrected reads.
A peak at 28 is still suspicious for a 25x dataset given the high error rate in the reads and the low chance of k-mer survival. I'd expect something like 1/4 of your coverage (e.g. I've seen peaks in the low 30s for 100+x coverage datasets).
Can you confirm that this 16-mer analysis is done on the raw data (i.e. no filtering)? We would expect more unique and low abundant k-mers ...
Lowering corMinCoverage helps a lot in gaining more corrected reads (increase from 3x to 9x) - but now we face a big drop in the trimming step:
-- INPUT READS:
-- -----------
-- 13444996 reads 144872263784 bases (reads processed)
-- 0 reads 0 bases (reads not processed, previously deleted)
-- 0 reads 0 bases (reads not processed, in a library where trimming isn't allowed)
--
-- OUTPUT READS:
-- ------------
-- 8641707 reads 57,876,361,867 bases (trimmed reads output)
-- 414450 reads 3,155,975,421 bases (reads with no change, kept as is)
-- 3325894 reads 27,383,155,690 bases (reads with no overlaps, deleted)
-- 1062945 reads 15,276,328,504 bases (reads with short trimmed length, deleted)
--
-- TRIMMING DETAILS:
-- ----------------
-- 7680471 reads 23,425,668,345 bases (bases trimmed from the 5' end of a read)
-- 8121168 reads 17,754,773,957 bases (bases trimmed from the 3' end of a read)
Anyway to set additional trim parameters to cope with this (although we understand that low overlap will also mean small chance of contigging ...)?
Yep, the 16-mer counts are done with raw data. I think this is pointing to low quality data with a small high coverage contaminant (like a plastid). This would be consistent with the trimming eliminating a lot of the data.
You could increase the correctedErrorRate to 0.10 though this will slow trimming down significantly but will keep more data. Is your corMinCoverage set to 0 or higher? With corMinCoverage=0, I would expect all your data to be corrected not just 10x. You could also try multiple correction rounds as seemed to work in issue #420
Thanks Yes I used both corMinCoverage=0 and correctedErrorRate=0.10
That implies more than 50% of the reads don't have any overlaps with high sensitivity. What is the summary report for correction overlaps in the run?
Multiple correction rounds will likely result in less corrected coverage I would say. So how to get more coverage?
I would expect all your data to be corrected not just 10x.
So how to achieve this? I could start another analysis from scratch and lower params (minOverlap and minReadLength) and maybe increase rawErrorRate? It's likely that repeats are abundant - I would say that contamination would show an additional peak in the k-mer plot; which I don't see ..
Multiple correction rounds would recover very low quality data that's being trimmed now but won't exceed the coverage you get from the first round (e.g. 10x). Given you're already running with high sensitivity and min coverage 0, the only option would be to drop the mer size to maybe 14 but that will increase runtime. That's why I was asking about the overlapping report. Repeats won't matter, they would generate more not less overlaps. Essentially the data seems like is only the contaminant (that's the only peak), the majority of the target input data seems more like noise.
So to run multiple rounds, I run correction with your current parameters and take all the corrected reads and any input reads that didn't get corrected and input them as raw pacbio reads again into correction with the same parameters and repeat this 4-5 times.
-----------------
37535364 (input reads)
397623746990 (input bases)
CORRECTION OUTPUTS:
-----------------
24090368 (reads that failed to generate any corrected bases)
13444996 (corrected read pieces)
144872263784 (corrected bases)
PIECES PER READ:
---------------
0 pieces: 24090368
1 pieces: 13444996
I meant these overlap stats not the post-correction stats:
----------------------------------------
-- PARAMETERS:
-- ----------
--
-- 40 (expected coverage)
-- 0 (don't use overlaps shorter than this)
-- 0.000 (don't use overlaps with erate less than this)
-- 1.000 (don't use overlaps with erate more than this)
--
-- OVERLAPS:
-- --------
--
-- IGNORED:
--
-- 0 (< 0.0000 fraction error)
-- 0 (> 0.4095 fraction error)
-- 0 (< 0 bases long)
-- 0 (> 2097151 bases long)
--
-- FILTERED:
--
-- 409492177815 (too many overlaps, discard these shortest ones)
--
-- EVIDENCE:
--
-- 644111355 (longest overlaps)
--
-- TOTAL:
--
-- 410136289170 (all overlaps)
--
-- READS:
-- -----
--
-- 223194 (no overlaps)
-- 1923883 (no overlaps filtered)
-- 1354637 (< 50% overlaps filtered)
-- 3060707 (< 80% overlaps filtered)
-- 5008242 (< 95% overlaps filtered)
-- 15059837 (< 100% overlaps filtered)
--
-- Computing expected corrected read lengths 'correction/2-correction/asm.estimate.log'.
The first bit of logging in here is reporting
-- Reads to be corrected:
-- 37535364 reads longer than 0 bp
-- 397623746990 bp
-- Expected corrected reads:
-- 37535364 reads
-- 132550121930 bp
which is saying 2/3 of the bases have no overlap at all.
This bit of logging
-- FILTERED:
-- 836777765067 (too many overlaps, discard these shortest ones)
--
-- EVIDENCE:
-- 1206913429 (longest overlaps)
--
-- TOTAL:
-- 837984678496 (all overlaps)
says it threw out the vast majority of overlaps because they're too repetitive.
The last bit of logging hints that 2/3 of the reads had no (or very few) overlaps.
Instead of what Serge is asking for, generate a histogram from:
overlapStore -G *gkpStore -O *ovlStore -d -counts
This will dump a table of the number of ovelraps per read: 'readID numberOfOverlaps'.
Thanks guys. Here's the log Sergey asked for:
----------
40 (expected coverage)
0 (don't use overlaps shorter than this)
0.000 (don't use overlaps with erate less than this)
1.000 (don't use overlaps with erate more than this)
OVERLAPS:
--------
IGNORED:
0 (< 0.0000 fraction error)
0 (> 0.4095 fraction error)
0 (< 0 bases long)
0 (> 2097151 bases long)
FILTERED:
836777765067 (too many overlaps, discard these shortest ones)
EVIDENCE:
1206913429 (longest overlaps)
TOTAL:
837984678496 (all overlaps)
READS:
-----
512333 (no overlaps)
11541183 (no overlaps filtered)
3543081 (< 50% overlaps filtered)
8287523 (< 80% overlaps filtered)
12783144 (< 95% overlaps filtered)
25994181 (< 100% overlaps filtered)
I cannot find the executable overlapStore (in canu-1.5, nor 1.6). And in which directory to run? I guess /correction?
I do see ovStoreStats (but that did not work) and OverlapStore.pm in lib/canu
Whoops, sorry, make that ovStoreDump (overlapStore was the old old old name for the command).
Nr of occurrences for 0 overlap to 300x overlap. max # overlaps = 1951797 (for 1 read)
So mode at 2 overlaps and very long tail (up to almost 2 M overlaps for 1 read). What does this tell us? I guess very repeat rich content ...
Ah, thanks for the log axis. I was squinting at the other one yesterday to see this.
With 25x coverage, the mode should be somewhere between 25 and 50 or so, not 8. This could mean sequencing failed, the source DNA was incredibly diverged, the genome is waaay larger than you think, ...
Looking at the 22-mer histogram for 'unitigging' there seems to be about 5x coverage in those reads. The max count is very high, but only a few different mers, implying there are some super-simple repeats or homopolymer junk causing the enormous number of overlaps for a few reads.
Have you completed an assembly for the corrected reads you have? Do any of the raw reads map to these? I'm attempting to find a way to tell if the reads are random sequence or real sequence but noisy and/or diverged.
For assembly, I can't think of anything to do except drop the kmer size as Sergey suggested. As he said, this will increase run time, and if it works, the size of overlap output will be much bigger.
Yes, I second what Brian said. I think it is consistent with some very repetitive or contaminant high coverage sequence but the majority of the data has no overlaps indicating it is noisy or very diverged or the genome is about 4-fold larger than your estimate.
However, I still expect more reads to have been output from correction with corMinCoverage=0, all the reads with >=1 overlap should have some bases corrected. You can see in the logs it says it is trying to correct 397,623,746,990 of your total bases which is about 97% but then all those reads are lost to correctReads.sh (most end up with 0 pieces). Can you check the shell script and confirm it is passing 0 as the coverage threshold?
So if the correctReads.sh shell script is specifying coverage 0 and the count of reads in asm.readsToCorrect is the full dataset, it is possible there is a short circuit in the consensus caller that skips a read when it has no alignments, even with corMinCov=0. If that is the case, we should fix it.
My earlier suggestion, while a bit cumbersome, should work to determine if the read quality is the issue or if is something else. You would run correction, take all reads that have been corrected along with all raw reads that are not output in correction to get back to full coverage and run correction again on that dataset. Repeat that 4-5 times and try to run assembly and trimming on the final output.
Yes, I have an assembly with this set of corrected reads - yielding approx 100,00 unitigs totaling 4 Gb and an N50 of 40 kb. I haven't tried mapping raw reads yet. An other idea to check contamination could is to get GC content. The mode 2 for overlaps is actually the 'raw' number of overlaps (so not on log scale) - I guess we would expect a poisson like distribution with indeed a mode around 25 - so this puzzles me too. I'm pretty sure it's using the corMinCoverage - as I got a major increase from 50 Gb corrected reads (default corMinCoverage=40) to 144 Gb corrected. This is part of the correctReads.sh:
$bin/falcon_sense \
--min_idt 0.7 \
--min_len 1000\
--max_read_len 389014 \
--min_ovl_len 500\
--min_cov 0 \
--n_core 4 \
> ./correction_outputs/$jobid.fasta.WORKING \
2> ./correction_outputs/$jobid.err \
&& \
mv ./correction_outputs/$jobid.fasta.WORKING ./correction_outputs/$jobid.fasta \
I appreciate your suggestion for the iterative approach - but as it will take a lot of time and space I'm a little hesitant to start it. Also - will it prevent the assembly of any contamination - as these will also be confirmed by repetitive rounds of correction right?
Multiple rounds shouldn't take any more space than one since you can erase all the output of a previous round (except for the fasta.gz file) but it will take 5-times longer (or however many rounds of correction you run longer). It won't eliminate the contaminant but I'm not so much worried about that as losing >50% your data. Canu can deal with repeats in the assembly, it isn't going to affect your other data other than increasing disk usage during correction. I think the reason it's getting assembled is because it is so high coverage you have some higher identity reads by chance and it is getting corrected well. One other option is to try minimap2 instead of mhap for the correction overlaps which will more aggressively filter repeats, try corOverlapper=minimap
and symlink the minimap2 executable from the canu/Linux-amd64/bin directory. I don't expect this to make much difference in terms of the corrected reads but worth a try.
So you mean try minimap for the read correction step (and not for obt and utg step)? My latest assembly (with correctedError=0.1 minOverlapLength=300 minReadLength=500) starting from the same set of corrected reads (144 Gb) --> 41 Gb trimmed) resulted in double size (7.7 Gb) but still small pieces: 245 K unitigs N50 31 K ...
Yep, minimap just for correction.
Interesting that the genome size doubled, that could indicate high heterozygosity. Have you run genome scope http://qb.cshl.edu/genomescope/) on the corrected reads, you should be able to give it the unitigging/0-*/*.hist
file as input (just first two columns space separated)? You could also run BUSCO to see if you have gene duplication leading to the larger assembly size.
yep - ran busco: got approx half of the genes but very few duplicated
Idle, reopen if needed.
I've run canu-1.5 for a huge genome with (only) 25x PacBio data on a grid:
./canu-1.5/Linux-amd64/bin/canu -correct genomeSize=huge(>10g) corErrorRate=0.02 batThreads=64 batMemory=250 maxThreads=128 maxMemory=1024 gridEngineThreadsOption="-pe smp THREADS" gridOptions="-r y -V -p -500" gridEngineMemoryOption="-l mem_free=MEMORY" -pacbio-raw
and initially got no correctedReads because of the -
corErrorRate=0.02
- setting. Then changed the options torawErrorRate=0.3 correctedErrorRate=0.03
removed the 2-correction folder and asm.correctedReads.fasta.gz and restarted the pipeline.
It now finished but yielding only a 2x corrected reads set. The input 25x is expected to reduce to ~8x (why?), but actually reduces to 3x (corrected) and 2x (trimmed) The report is given below. Which parameters should I tune to get more corrected reads for the assembly step? Maybe
corMinCoverage=0
? Or should I redo the pipeline from scratch as possibly some steps could have been messed up with the corErrorRate setting? (the correction step is taking a lot of time ...- maybe I should also adjustmaxCount
??)