marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
656 stars 179 forks source link

An AMR gene missing in canu assembly but present in unassembled.fasta #523

Closed bioprojects closed 7 years ago

bioprojects commented 7 years ago

Dear developers,

Thank you for developing and maintaining this great software.

I found in two different datasets that an AMR gene was missing in canu assembly of a bacterial species, but present in the unassembled.fasta file or an alternative assembly created by miniasm.

I used the latest canu 1.5 with default parameters.

Below are results of BLAST of the AMR gene against the unassembled.fasta files of the two datasets.

http://yahara.hustle.ne.jp/projects/lftp_tmp/MM_PM_00000.png http://yahara.hustle.ne.jp/projects/lftp_tmp/MM_PM_00001.png

I guess the number of reads covering the AMR gene is small, but in the two datasets, approximately 3.1 and 7.2 Gb were sequenced by the nanopore R9.4. The coverage should be fine, and the number of assembled contig was one for each dataset as below

>tig00000001 len=4224298 reads=1046 covStat=379.97 gappedBases=no class=contig suggestRepeat=no suggestCircular=no
>tig00000001 len=4208843 reads=852 covStat=407.77 gappedBases=no class=contig suggestRepeat=no suggestCircular=no

Do you have any idea how to improve this problem? Thanks a lot in advance.

Best wishes,

Koji

=== Koji Yahara Senior Research Fellow Antimicrobial Resistance Research Center National Institute of Infectious Diseases Aoba-cho 4-2-1, Higashimurayama-shi, Tokyo 189-0002 Japan Tel: +81-42-561-0771 (Ex. 3539) Fax: +81-42-561-7173

gringer commented 7 years ago

I suspect the contig is too small to be assembled because a read has been found that is at least 75% of the contig.

http://canu.readthedocs.io/en/latest/faq.html#my-asm-contigs-fasta-is-empty-why

If you are assembling amplified or viral data, it is possible your assembly will be flagged as unassembled. Turn off filtering with the parameters contigFilter="2 1000 1.0 1.0 2".

skoren commented 7 years ago

Right, @gringer is correct. The unassembled filter defaults will remove contigs where everything (or almost everything) is contained in a single read. If the AMR gene is present on a short plasmid rather than the main chromosome, the plasmid could get filtered by this. The header lines in the asm.unassembled.fasta should give information on how many reads and how much coverage the tig with the AMR gene has, something like:

>tig00000351 len=158412 reads=934 covStat-165.71 gappedBases=no class=unassm suggestRepeat=no suggestCircular=no

This unassembled tig contains 934 reads but was filtered because the longest read is 130kb (you can find this info in the asm.contigs.layout.readToTig). The negative covStat indicates this tig has high coverage as well.

bioprojects commented 7 years ago

Thank you very much for your answers.

In one of the two datasets, the contigFilter="2 1000 1.0 1.0 2" option enabled assembly of a plasmid carrying the AMR gene: "tig00000003 len=170173 reads=4 covStat=4.92".

But in another dataset, the contigFilter="2 1000 1.0 1.0 2" option did not change the result: http://yahara.hustle.ne.jp/projects/lftp_tmp/MM_PM_00000.png It seems that there are only 2 reads with approximately 90kb that have the AMR gene. Is there any other way to assemble the plasmid carrying the AMR gene? If there is no way, we will do the sequencing again so that we could get more reads on the plasmid.

Thanks a lot in advance.

Best wishes,

Koji

skoren commented 7 years ago

How much coverage total do you have for this dataset? By default Canu will only correct 40x longest data and if you have ultra-long reads and high enough coverage, this could be quite large, removing most of the plasmid reads. You can tell Canu to correct more of the data using corMinCoverage=100.

bioprojects commented 7 years ago

In the .report file, it is written that

[UNITIGGING/READS]
   Found 1063 reads.
   Found 151426165 bases (39.84 times coverage).

so the final coverage is approximately 40x, as expected.

I think your suggestion is corOutCoverage=100 rather than corMinCoverage=100, right? I'll try. Thanks a lot again.

Best wishes,

Koji

skoren commented 7 years ago

That's the post-correction coverage, I'm curious about the total input coverage (the first histogram output by Canu). Yes, I meant corOutCoverage.

bioprojects commented 7 years ago

Thanks a lot again. The total input coverage is 7817076068 bases (2057.12 times coverage) according to the report file.

skoren commented 7 years ago

Yeah, in that case it's very likely the default 40x cutoff removed any reads which could have come from the plasmids. By default it only corrects the longest reads. In your case, your average corrected reads is 140kbp (151426165 / 1063) which is why the plasmid has no reads. The histogram output should confirm this, there should be a histogram on the input and a histogram of the post-correction reads that look like this:

-- In gatekeeper store 'correction/asm.gkpStore':
--   Found 15326084 reads.
--   Found 202297373428 bases (65.25 times coverage).
--
--   Read length histogram (one '*' equals 15873.34 reads):
--        0    999      0 
--     1000   1999 1111134 **********************************************************************
--     2000   2999 999286 **************************************************************
--     3000   3999 902404 ********************************************************
--     4000   4999 829355 ****************************************************
--     5000   5999 776099 ************************************************
--     6000   6999 728449 *********************************************
--     7000   7999 685063 *******************************************
--     8000   8999 646037 ****************************************
--     9000   9999 611854 **************************************
--    10000  10999 579891 ************************************
--    11000  11999 545977 **********************************
--    12000  12999 515892 ********************************
--    13000  13999 484771 ******************************
--    14000  14999 456053 ****************************
--    15000  15999 429368 ***************************
--    16000  16999 404595 *************************
--    17000  17999 379590 ***********************
--    18000  18999 357064 **********************
--    19000  19999 336519 *********************
--    20000  20999 317118 *******************
--    21000  21999 299393 ******************
--    22000  22999 284225 *****************
--    23000  23999 268192 ****************
--    24000  24999 249867 ***************
--    25000  25999 227466 **************
--    26000  26999 205223 ************
--    27000  27999 184659 ***********
--    28000  28999 166400 **********
--    29000  29999 149757 *********
--    30000  30999 135546 ********
--    31000  31999 124086 *******
--    32000  32999 113894 *******
--    33000  33999 105367 ******
--    34000  34999  96042 ******
--    35000  35999  88310 *****
--    36000  36999  79800 *****
--    37000  37999  70553 ****
--    38000  38999  63035 ***
--    39000  39999  54492 ***
--    40000  40999  47861 ***
--    41000  41999  40603 **
--    42000  42999  34452 **
--    43000  43999  28659 *
--    44000  44999  23803 *
--    45000  45999  19094 *
--    46000  46999  15166 
--    47000  47999  11885 
--    48000  48999   9301 
--    49000  49999   7414 
--    50000  50999   5895 
--    51000  51999   4490 
--    52000  52999   3303 
--    53000  53999   2725 
--    54000  54999   2015 
--    55000  55999   1549 
--    56000  56999   1198 
--    57000  57999    891 
--    58000  58999    746 
--    59000  59999    519 
--    60000  60999    402 
--    61000  61999    305 
--    62000  62999    249 
--    63000  63999    193 
--    64000  64999    139 
--    65000  65999    118 
--    66000  66999     90 
--    67000  67999     47 
--    68000  68999     43 
--    69000  69999     23 
--    70000  70999     26 
--    71000  71999     10 
--    72000  72999     18 
--    73000  73999      9 
--    74000  74999      7 
--    75000  75999      2 
--    76000  76999      3 
--    77000  77999      2 
--    78000  78999      1 
--    79000  79999      1 
--    80000  80999      1 

and post correction:

--   Found 6005267 reads.
--   Found 115151525620 bases (37.14 times coverage).
--
--   Read length histogram (one '*' equals 5911.52 reads):
--        0    999      0 
--     1000   1999  25218 ****
--     2000   2999  20073 ***
--     3000   3999  18639 ***
--     4000   4999  18735 ***
--     5000   5999  19358 ***
--     6000   6999  21206 ***
--     7000   7999  24923 ****
--     8000   8999  34229 *****
--     9000   9999 137219 ***********************
--    10000  10999 413807 **********************************************************************
--    11000  11999 410031 *********************************************************************
--    12000  12999 388347 *****************************************************************
--    13000  13999 366481 *************************************************************
--    14000  14999 346075 **********************************************************
--    15000  15999 325536 *******************************************************
--    16000  16999 304502 ***************************************************
--    17000  17999 285213 ************************************************
--    18000  18999 267161 *********************************************
--    19000  19999 249353 ******************************************
--    20000  20999 234270 ***************************************
--    21000  21999 221620 *************************************
--    22000  22999 207612 ***********************************
--    23000  23999 193920 ********************************
--    24000  24999 174525 *****************************
--    25000  25999 154273 **************************
--    26000  26999 136457 ***********************
--    27000  27999 121088 ********************
--    28000  28999 107683 ******************
--    29000  29999  95358 ****************
--    30000  30999  86798 **************
--    31000  31999  78758 *************
--    32000  32999  71971 ************
--    33000  33999  66286 ***********
--    34000  34999  59843 **********
--    35000  35999  53668 *********
--    36000  36999  47041 *******
--    37000  37999  40741 ******
--    38000  38999  34975 *****
--    39000  39999  29913 *****
--    40000  40999  24989 ****
--    41000  41999  20527 ***
--    42000  42999  16441 **
--    43000  43999  12930 **
--    44000  44999   9948 *
--    45000  45999   7434 *
--    46000  46999   5549 
--    47000  47999   4183 
--    48000  48999   3041 
--    49000  49999   2221 
--    50000  50999   1547 
--    51000  51999   1124 
--    52000  52999    771 
--    53000  53999    535 
--    54000  54999    357 
--    55000  55999    242 
--    56000  56999    187 
--    57000  57999    120 
--    58000  58999     76 
--    59000  59999     45 
--    60000  60999     34 
--    61000  61999     12 
--    62000  62999     16 
--    63000  63999     10 
--    64000  64999      5 
--    65000  65999      6 
--    66000  66999      5 
--    67000  67999      2 
--    68000  68999      1 
--    69000  69999      2 
--    70000  70999      0 
--    71000  71999      1 

You can see the correction eliminates most short reads, in this case <9000. Setting corOutCoverage to 100 would probably be enough but you can also check what coverage you need to set to correct reads longer than about 50kb to make sure you get the plasmid. This is also mentioned on the Canu FAQ: http://canu.readthedocs.io/en/latest/faq.html#why-is-my-assembly-is-missing-my-favorite-short-plasmid.

bioprojects commented 7 years ago

I've confirmed your point from the following histgrams:

[total input ]
-- In gatekeeper store 'correction/170420_367_cat_MM_FAB49580.gkpStore':
--   Found 514126 reads.
--   Found 7817076068 bases (2057.12 times coverage).
--
--   Read length histogram (one '*' equals 4209.38 reads):
--        0   9999 294657 **********************************************************************
--    10000  19999  98238 ***********************
--    20000  29999  47945 ***********
--    30000  39999  27080 ******
--    40000  49999  16230 ***
--    50000  59999  10022 **
--    60000  69999   7295 *
--    70000  79999   4121
--    80000  89999   2590
--    90000  99999   1808
--   100000 109999   1185
--   110000 119999    790
--   120000 129999    564
--   130000 139999    432
--   140000 149999    311
--   150000 159999    222
--   160000 169999    152
--   170000 179999    107
--   180000 189999     81
--   190000 199999     69
--   200000 209999     52
--   210000 219999     45
--   220000 229999     33
--   230000 239999     23
--   240000 249999      8
--   250000 259999     15
--   260000 269999     12
--   270000 279999      7
--   280000 289999      3
--   290000 299999      7
--   300000 309999      3
--   310000 319999      4
--   320000 329999      3
--   330000 339999      2
--   340000 349999      3
--   350000 359999      0
--   360000 369999      2
--   370000 379999      1
--   380000 389999      0
--   390000 399999      1
--   400000 409999      0
--   410000 419999      0
--   420000 429999      0
--   430000 439999      0
--   440000 449999      0
--   450000 459999      0
--   460000 469999      0
--   470000 479999      1
--   480000 489999      0
--   490000 499999      0
--   500000 509999      0
--   510000 519999      0
--   520000 529999      0
--   530000 539999      0
--   540000 549999      0
--   550000 559999      0
--   560000 569999      0
--   570000 579999      0
--   580000 589999      0
--   590000 599999      0
--   600000 609999      0
--   610000 619999      0
--   620000 629999      1
--   630000 639999      0
--   640000 649999      0
--   650000 659999      0
--   660000 669999      0
--   670000 679999      0
--   680000 689999      0
--   690000 699999      0
--   700000 709999      0
--   710000 719999      0
--   720000 729999      0
--   730000 739999      0
--   740000 749999      0
--   750000 759999      0
--   760000 769999      0
--   770000 779999      0
--   780000 789999      0
--   790000 799999      0
--   800000 809999      0
--   810000 819999      0
--   820000 829999      0
--   830000 839999      0
--   840000 849999      0
--   850000 859999      0
--   860000 869999      0
--   870000 879999      0
--   880000 889999      0
--   890000 899999      0
--   900000 909999      0
--   910000 919999      0
--   920000 929999      0
--   930000 939999      0
--   940000 949999      0
--   950000 959999      0
--   960000 969999      0
--   970000 979999      0
--   980000 989999      0
--   990000 999999      0
--   1000000 1009999      0
--   1010000 1019999      0
--   1020000 1029999      0
--   1030000 1039999      0
--   1040000 1049999      0
--   1050000 1059999      0
--   1060000 1069999      0
--   1070000 1079999      0
--   1080000 1089999      0
--   1090000 1099999      0
--   1100000 1109999      1
[after correction]
-- In gatekeeper store 'unitigging/170420_367_cat_MM_FAB49580.gkpStore':
--   Found 1063 reads.
--   Found 151426165 bases (39.84 times coverage).
--
--   Read length histogram (one '*' equals 2.32 reads):
--        0   4999      0
--     5000   9999      2
--    10000  14999      2
--    15000  19999      3 *
--    20000  24999      0
--    25000  29999      0
--    30000  34999      1
--    35000  39999      0
--    40000  44999      0
--    45000  49999      1
--    50000  54999      0
--    55000  59999      0
--    60000  64999      1
--    65000  69999      1
--    70000  74999      1
--    75000  79999      0
--    80000  84999      0
--    85000  89999      1
--    90000  94999      2
--    95000  99999      1
--   100000 104999      2
--   105000 109999      5 **
--   110000 114999     79 *********************************
--   115000 119999    163 **********************************************************************
--   120000 124999    128 ******************************************************
--   125000 129999    100 ******************************************
--   130000 134999    102 *******************************************
--   135000 139999     68 *****************************
--   140000 144999     70 ******************************
--   145000 149999     39 ****************
--   150000 154999     33 **************
--   155000 159999     24 **********
--   160000 164999     27 ***********
--   165000 169999     32 *************
--   170000 174999     23 *********
--   175000 179999     25 **********
--   180000 184999     16 ******
--   185000 189999      9 ***
--   190000 194999     15 ******
--   195000 199999      8 ***
--   200000 204999     13 *****
--   205000 209999      9 ***
--   210000 214999      4 *
--   215000 219999      8 ***
--   220000 224999      4 *
--   225000 229999      5 **
--   230000 234999      6 **
--   235000 239999      4 *
--   240000 244999      1
--   245000 249999      1
--   250000 254999      0
--   255000 259999      5 **
--   260000 264999      2
--   265000 269999      1
--   270000 274999      4 *
--   275000 279999      0
--   280000 284999      3 *
--   285000 289999      1
--   290000 294999      2
--   295000 299999      0
--   300000 304999      0
--   305000 309999      1
--   310000 314999      1
--   315000 319999      1
--   320000 324999      0
--   325000 329999      0
--   330000 334999      0
--   335000 339999      1
--   340000 344999      0
--   345000 349999      1
--   350000 354999      0
--   355000 359999      0
--   360000 364999      1
skoren commented 7 years ago

Yes, so increase the corOutCoverage to 100 or 200 or take a random subset of 100-200x of your data and run with defaults, both should capture enough data for the plasmid to assemble.

bioprojects commented 7 years ago

Thank you, I'll try it!!

brianwalenz commented 7 years ago

Reopen if needed.

bioprojects commented 7 years ago

Thank you very much again for your kind responses previously. The corOutCoverage=100 and contigFilter="2 1000 1.0 1.0 2" options succeeded in assembling the plasmids by Canu 1.5.

Does the contigFilter="2 1000 1.0 1.0 2" work for Canu 1.6 as well? The release note in Canu 1.6 writes in bold that "Note that the contigFilter option values have changed and old values run the risk of filtering incorrectly."

I tried to find how the contigFilter option changed by comparing the latest and previous manuals, but I was not sure.

I really appreciate your comments. Thanks a lot in advance.

Best wishes,

Koji

=== Koji Yahara Senior Research Fellow Antimicrobial Resistance Research Center National Institute of Infectious Diseases Aoba-cho 4-2-1, Higashimurayama-shi, Tokyo 189-0002 Japan Tel: +81-42-561-0771 (Ex. 3539) Fax: +81-42-561-7173

brianwalenz commented 7 years ago

The values have the same meaning, but they're applied in the middle of contig construction instead of at the very end. The new parameters filter initial contigs that are below 5x coverage. In addition to plasmids, the old filter was catching repeats, which resulted in disconnected output graphs. The new filter seems to only catch reads that don't assemble at all, and reads that partially assembled (e.g., with one other read).

A bit more detail is at: http://canu.readthedocs.io/en/latest/faq.html#my-asm-contigs-fasta-is-empty-why

bioprojects commented 7 years ago

Thank you. In order not to miss an AMR gene in a plasmid, I've been using the following option: overlapper=mhap utgReAlign=true corOutCoverage=100 contigFilter="2 1000 1.0 1.0 2"

I've confirmed that the option works for both canu v1.5 and v1.6.

Thank you!