Closed bioprojects closed 7 years ago
I suspect the contig is too small to be assembled because a read has been found that is at least 75% of the contig.
http://canu.readthedocs.io/en/latest/faq.html#my-asm-contigs-fasta-is-empty-why
If you are assembling amplified or viral data, it is possible your assembly will be flagged as unassembled. Turn off filtering with the parameters
contigFilter="2 1000 1.0 1.0 2"
.
Right, @gringer is correct. The unassembled filter defaults will remove contigs where everything (or almost everything) is contained in a single read. If the AMR gene is present on a short plasmid rather than the main chromosome, the plasmid could get filtered by this. The header lines in the asm.unassembled.fasta should give information on how many reads and how much coverage the tig with the AMR gene has, something like:
>tig00000351 len=158412 reads=934 covStat-165.71 gappedBases=no class=unassm suggestRepeat=no suggestCircular=no
This unassembled tig contains 934 reads but was filtered because the longest read is 130kb (you can find this info in the asm.contigs.layout.readToTig
). The negative covStat indicates this tig has high coverage as well.
Thank you very much for your answers.
In one of the two datasets, the contigFilter="2 1000 1.0 1.0 2" option enabled assembly of a plasmid carrying the AMR gene: "tig00000003 len=170173 reads=4 covStat=4.92".
But in another dataset, the contigFilter="2 1000 1.0 1.0 2" option did not change the result: http://yahara.hustle.ne.jp/projects/lftp_tmp/MM_PM_00000.png It seems that there are only 2 reads with approximately 90kb that have the AMR gene. Is there any other way to assemble the plasmid carrying the AMR gene? If there is no way, we will do the sequencing again so that we could get more reads on the plasmid.
Thanks a lot in advance.
Best wishes,
Koji
How much coverage total do you have for this dataset? By default Canu will only correct 40x longest data and if you have ultra-long reads and high enough coverage, this could be quite large, removing most of the plasmid reads. You can tell Canu to correct more of the data using corMinCoverage=100
.
In the .report file, it is written that
[UNITIGGING/READS]
Found 1063 reads.
Found 151426165 bases (39.84 times coverage).
so the final coverage is approximately 40x, as expected.
I think your suggestion is corOutCoverage=100 rather than corMinCoverage=100, right? I'll try. Thanks a lot again.
Best wishes,
Koji
That's the post-correction coverage, I'm curious about the total input coverage (the first histogram output by Canu). Yes, I meant corOutCoverage.
Thanks a lot again. The total input coverage is 7817076068 bases (2057.12 times coverage)
according to the report file.
Yeah, in that case it's very likely the default 40x cutoff removed any reads which could have come from the plasmids. By default it only corrects the longest reads. In your case, your average corrected reads is 140kbp (151426165 / 1063) which is why the plasmid has no reads. The histogram output should confirm this, there should be a histogram on the input and a histogram of the post-correction reads that look like this:
-- In gatekeeper store 'correction/asm.gkpStore':
-- Found 15326084 reads.
-- Found 202297373428 bases (65.25 times coverage).
--
-- Read length histogram (one '*' equals 15873.34 reads):
-- 0 999 0
-- 1000 1999 1111134 **********************************************************************
-- 2000 2999 999286 **************************************************************
-- 3000 3999 902404 ********************************************************
-- 4000 4999 829355 ****************************************************
-- 5000 5999 776099 ************************************************
-- 6000 6999 728449 *********************************************
-- 7000 7999 685063 *******************************************
-- 8000 8999 646037 ****************************************
-- 9000 9999 611854 **************************************
-- 10000 10999 579891 ************************************
-- 11000 11999 545977 **********************************
-- 12000 12999 515892 ********************************
-- 13000 13999 484771 ******************************
-- 14000 14999 456053 ****************************
-- 15000 15999 429368 ***************************
-- 16000 16999 404595 *************************
-- 17000 17999 379590 ***********************
-- 18000 18999 357064 **********************
-- 19000 19999 336519 *********************
-- 20000 20999 317118 *******************
-- 21000 21999 299393 ******************
-- 22000 22999 284225 *****************
-- 23000 23999 268192 ****************
-- 24000 24999 249867 ***************
-- 25000 25999 227466 **************
-- 26000 26999 205223 ************
-- 27000 27999 184659 ***********
-- 28000 28999 166400 **********
-- 29000 29999 149757 *********
-- 30000 30999 135546 ********
-- 31000 31999 124086 *******
-- 32000 32999 113894 *******
-- 33000 33999 105367 ******
-- 34000 34999 96042 ******
-- 35000 35999 88310 *****
-- 36000 36999 79800 *****
-- 37000 37999 70553 ****
-- 38000 38999 63035 ***
-- 39000 39999 54492 ***
-- 40000 40999 47861 ***
-- 41000 41999 40603 **
-- 42000 42999 34452 **
-- 43000 43999 28659 *
-- 44000 44999 23803 *
-- 45000 45999 19094 *
-- 46000 46999 15166
-- 47000 47999 11885
-- 48000 48999 9301
-- 49000 49999 7414
-- 50000 50999 5895
-- 51000 51999 4490
-- 52000 52999 3303
-- 53000 53999 2725
-- 54000 54999 2015
-- 55000 55999 1549
-- 56000 56999 1198
-- 57000 57999 891
-- 58000 58999 746
-- 59000 59999 519
-- 60000 60999 402
-- 61000 61999 305
-- 62000 62999 249
-- 63000 63999 193
-- 64000 64999 139
-- 65000 65999 118
-- 66000 66999 90
-- 67000 67999 47
-- 68000 68999 43
-- 69000 69999 23
-- 70000 70999 26
-- 71000 71999 10
-- 72000 72999 18
-- 73000 73999 9
-- 74000 74999 7
-- 75000 75999 2
-- 76000 76999 3
-- 77000 77999 2
-- 78000 78999 1
-- 79000 79999 1
-- 80000 80999 1
and post correction:
-- Found 6005267 reads.
-- Found 115151525620 bases (37.14 times coverage).
--
-- Read length histogram (one '*' equals 5911.52 reads):
-- 0 999 0
-- 1000 1999 25218 ****
-- 2000 2999 20073 ***
-- 3000 3999 18639 ***
-- 4000 4999 18735 ***
-- 5000 5999 19358 ***
-- 6000 6999 21206 ***
-- 7000 7999 24923 ****
-- 8000 8999 34229 *****
-- 9000 9999 137219 ***********************
-- 10000 10999 413807 **********************************************************************
-- 11000 11999 410031 *********************************************************************
-- 12000 12999 388347 *****************************************************************
-- 13000 13999 366481 *************************************************************
-- 14000 14999 346075 **********************************************************
-- 15000 15999 325536 *******************************************************
-- 16000 16999 304502 ***************************************************
-- 17000 17999 285213 ************************************************
-- 18000 18999 267161 *********************************************
-- 19000 19999 249353 ******************************************
-- 20000 20999 234270 ***************************************
-- 21000 21999 221620 *************************************
-- 22000 22999 207612 ***********************************
-- 23000 23999 193920 ********************************
-- 24000 24999 174525 *****************************
-- 25000 25999 154273 **************************
-- 26000 26999 136457 ***********************
-- 27000 27999 121088 ********************
-- 28000 28999 107683 ******************
-- 29000 29999 95358 ****************
-- 30000 30999 86798 **************
-- 31000 31999 78758 *************
-- 32000 32999 71971 ************
-- 33000 33999 66286 ***********
-- 34000 34999 59843 **********
-- 35000 35999 53668 *********
-- 36000 36999 47041 *******
-- 37000 37999 40741 ******
-- 38000 38999 34975 *****
-- 39000 39999 29913 *****
-- 40000 40999 24989 ****
-- 41000 41999 20527 ***
-- 42000 42999 16441 **
-- 43000 43999 12930 **
-- 44000 44999 9948 *
-- 45000 45999 7434 *
-- 46000 46999 5549
-- 47000 47999 4183
-- 48000 48999 3041
-- 49000 49999 2221
-- 50000 50999 1547
-- 51000 51999 1124
-- 52000 52999 771
-- 53000 53999 535
-- 54000 54999 357
-- 55000 55999 242
-- 56000 56999 187
-- 57000 57999 120
-- 58000 58999 76
-- 59000 59999 45
-- 60000 60999 34
-- 61000 61999 12
-- 62000 62999 16
-- 63000 63999 10
-- 64000 64999 5
-- 65000 65999 6
-- 66000 66999 5
-- 67000 67999 2
-- 68000 68999 1
-- 69000 69999 2
-- 70000 70999 0
-- 71000 71999 1
You can see the correction eliminates most short reads, in this case <9000. Setting corOutCoverage to 100 would probably be enough but you can also check what coverage you need to set to correct reads longer than about 50kb to make sure you get the plasmid. This is also mentioned on the Canu FAQ: http://canu.readthedocs.io/en/latest/faq.html#why-is-my-assembly-is-missing-my-favorite-short-plasmid.
I've confirmed your point from the following histgrams:
[total input ]
-- In gatekeeper store 'correction/170420_367_cat_MM_FAB49580.gkpStore':
-- Found 514126 reads.
-- Found 7817076068 bases (2057.12 times coverage).
--
-- Read length histogram (one '*' equals 4209.38 reads):
-- 0 9999 294657 **********************************************************************
-- 10000 19999 98238 ***********************
-- 20000 29999 47945 ***********
-- 30000 39999 27080 ******
-- 40000 49999 16230 ***
-- 50000 59999 10022 **
-- 60000 69999 7295 *
-- 70000 79999 4121
-- 80000 89999 2590
-- 90000 99999 1808
-- 100000 109999 1185
-- 110000 119999 790
-- 120000 129999 564
-- 130000 139999 432
-- 140000 149999 311
-- 150000 159999 222
-- 160000 169999 152
-- 170000 179999 107
-- 180000 189999 81
-- 190000 199999 69
-- 200000 209999 52
-- 210000 219999 45
-- 220000 229999 33
-- 230000 239999 23
-- 240000 249999 8
-- 250000 259999 15
-- 260000 269999 12
-- 270000 279999 7
-- 280000 289999 3
-- 290000 299999 7
-- 300000 309999 3
-- 310000 319999 4
-- 320000 329999 3
-- 330000 339999 2
-- 340000 349999 3
-- 350000 359999 0
-- 360000 369999 2
-- 370000 379999 1
-- 380000 389999 0
-- 390000 399999 1
-- 400000 409999 0
-- 410000 419999 0
-- 420000 429999 0
-- 430000 439999 0
-- 440000 449999 0
-- 450000 459999 0
-- 460000 469999 0
-- 470000 479999 1
-- 480000 489999 0
-- 490000 499999 0
-- 500000 509999 0
-- 510000 519999 0
-- 520000 529999 0
-- 530000 539999 0
-- 540000 549999 0
-- 550000 559999 0
-- 560000 569999 0
-- 570000 579999 0
-- 580000 589999 0
-- 590000 599999 0
-- 600000 609999 0
-- 610000 619999 0
-- 620000 629999 1
-- 630000 639999 0
-- 640000 649999 0
-- 650000 659999 0
-- 660000 669999 0
-- 670000 679999 0
-- 680000 689999 0
-- 690000 699999 0
-- 700000 709999 0
-- 710000 719999 0
-- 720000 729999 0
-- 730000 739999 0
-- 740000 749999 0
-- 750000 759999 0
-- 760000 769999 0
-- 770000 779999 0
-- 780000 789999 0
-- 790000 799999 0
-- 800000 809999 0
-- 810000 819999 0
-- 820000 829999 0
-- 830000 839999 0
-- 840000 849999 0
-- 850000 859999 0
-- 860000 869999 0
-- 870000 879999 0
-- 880000 889999 0
-- 890000 899999 0
-- 900000 909999 0
-- 910000 919999 0
-- 920000 929999 0
-- 930000 939999 0
-- 940000 949999 0
-- 950000 959999 0
-- 960000 969999 0
-- 970000 979999 0
-- 980000 989999 0
-- 990000 999999 0
-- 1000000 1009999 0
-- 1010000 1019999 0
-- 1020000 1029999 0
-- 1030000 1039999 0
-- 1040000 1049999 0
-- 1050000 1059999 0
-- 1060000 1069999 0
-- 1070000 1079999 0
-- 1080000 1089999 0
-- 1090000 1099999 0
-- 1100000 1109999 1
[after correction]
-- In gatekeeper store 'unitigging/170420_367_cat_MM_FAB49580.gkpStore':
-- Found 1063 reads.
-- Found 151426165 bases (39.84 times coverage).
--
-- Read length histogram (one '*' equals 2.32 reads):
-- 0 4999 0
-- 5000 9999 2
-- 10000 14999 2
-- 15000 19999 3 *
-- 20000 24999 0
-- 25000 29999 0
-- 30000 34999 1
-- 35000 39999 0
-- 40000 44999 0
-- 45000 49999 1
-- 50000 54999 0
-- 55000 59999 0
-- 60000 64999 1
-- 65000 69999 1
-- 70000 74999 1
-- 75000 79999 0
-- 80000 84999 0
-- 85000 89999 1
-- 90000 94999 2
-- 95000 99999 1
-- 100000 104999 2
-- 105000 109999 5 **
-- 110000 114999 79 *********************************
-- 115000 119999 163 **********************************************************************
-- 120000 124999 128 ******************************************************
-- 125000 129999 100 ******************************************
-- 130000 134999 102 *******************************************
-- 135000 139999 68 *****************************
-- 140000 144999 70 ******************************
-- 145000 149999 39 ****************
-- 150000 154999 33 **************
-- 155000 159999 24 **********
-- 160000 164999 27 ***********
-- 165000 169999 32 *************
-- 170000 174999 23 *********
-- 175000 179999 25 **********
-- 180000 184999 16 ******
-- 185000 189999 9 ***
-- 190000 194999 15 ******
-- 195000 199999 8 ***
-- 200000 204999 13 *****
-- 205000 209999 9 ***
-- 210000 214999 4 *
-- 215000 219999 8 ***
-- 220000 224999 4 *
-- 225000 229999 5 **
-- 230000 234999 6 **
-- 235000 239999 4 *
-- 240000 244999 1
-- 245000 249999 1
-- 250000 254999 0
-- 255000 259999 5 **
-- 260000 264999 2
-- 265000 269999 1
-- 270000 274999 4 *
-- 275000 279999 0
-- 280000 284999 3 *
-- 285000 289999 1
-- 290000 294999 2
-- 295000 299999 0
-- 300000 304999 0
-- 305000 309999 1
-- 310000 314999 1
-- 315000 319999 1
-- 320000 324999 0
-- 325000 329999 0
-- 330000 334999 0
-- 335000 339999 1
-- 340000 344999 0
-- 345000 349999 1
-- 350000 354999 0
-- 355000 359999 0
-- 360000 364999 1
Yes, so increase the corOutCoverage to 100 or 200 or take a random subset of 100-200x of your data and run with defaults, both should capture enough data for the plasmid to assemble.
Thank you, I'll try it!!
Reopen if needed.
Thank you very much again for your kind responses previously. The corOutCoverage=100 and contigFilter="2 1000 1.0 1.0 2" options succeeded in assembling the plasmids by Canu 1.5.
Does the contigFilter="2 1000 1.0 1.0 2" work for Canu 1.6 as well? The release note in Canu 1.6 writes in bold that "Note that the contigFilter option values have changed and old values run the risk of filtering incorrectly."
I tried to find how the contigFilter option changed by comparing the latest and previous manuals, but I was not sure.
I really appreciate your comments. Thanks a lot in advance.
Best wishes,
Koji
=== Koji Yahara Senior Research Fellow Antimicrobial Resistance Research Center National Institute of Infectious Diseases Aoba-cho 4-2-1, Higashimurayama-shi, Tokyo 189-0002 Japan Tel: +81-42-561-0771 (Ex. 3539) Fax: +81-42-561-7173
The values have the same meaning, but they're applied in the middle of contig construction instead of at the very end. The new parameters filter initial contigs that are below 5x coverage. In addition to plasmids, the old filter was catching repeats, which resulted in disconnected output graphs. The new filter seems to only catch reads that don't assemble at all, and reads that partially assembled (e.g., with one other read).
A bit more detail is at: http://canu.readthedocs.io/en/latest/faq.html#my-asm-contigs-fasta-is-empty-why
Thank you. In order not to miss an AMR gene in a plasmid, I've been using the following option: overlapper=mhap utgReAlign=true corOutCoverage=100 contigFilter="2 1000 1.0 1.0 2"
I've confirmed that the option works for both canu v1.5 and v1.6.
Thank you!
Dear developers,
Thank you for developing and maintaining this great software.
I found in two different datasets that an AMR gene was missing in canu assembly of a bacterial species, but present in the unassembled.fasta file or an alternative assembly created by miniasm.
I used the latest canu 1.5 with default parameters.
Below are results of BLAST of the AMR gene against the unassembled.fasta files of the two datasets.
http://yahara.hustle.ne.jp/projects/lftp_tmp/MM_PM_00000.png http://yahara.hustle.ne.jp/projects/lftp_tmp/MM_PM_00001.png
I guess the number of reads covering the AMR gene is small, but in the two datasets, approximately 3.1 and 7.2 Gb were sequenced by the nanopore R9.4. The coverage should be fine, and the number of assembled contig was one for each dataset as below
Do you have any idea how to improve this problem? Thanks a lot in advance.
Best wishes,
Koji
=== Koji Yahara Senior Research Fellow Antimicrobial Resistance Research Center National Institute of Infectious Diseases Aoba-cho 4-2-1, Higashimurayama-shi, Tokyo 189-0002 Japan Tel: +81-42-561-0771 (Ex. 3539) Fax: +81-42-561-7173