marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
655 stars 179 forks source link

Do I need read correction with guppy5 and SUP model? #2058

Closed gushiro closed 2 years ago

gushiro commented 2 years ago

I was wondering if I can avoid the correction step of my assembly and jumping directly to the trimming step and so on. My reads have been called with the SUP models in Guppy5+, but not sure if the Kmer histogram of the reads shows low error rate. I basically want to avoid the error correction because it was taken over 5Tb in my previous runs and I don't have enough disk space.

The histogram does not show a peak as clear as I have seen it by another users who asked related question. Also, I am thinking to use a min read length of 5K to get the largest 25X at least: would it be OK?.

I should mention the species is a diploid with a genome size of 2.5gb and 3.5% of heterozygosity.

Thanks in advance. Here is the output of Meryl:

[CORRECTION/READS]
--
-- In sequence store './test_meryl.seqStore':
--   Found 17012572 reads.
--   Found 89476994241 bases (35.79 times coverage).
--    Histogram of raw reads:
--    
--    G=89476994241                      sum of  ||               length     num
--    NG         length     index       lengths  ||                range    seqs
--    ----- ------------ --------- ------------  ||  ------------------- -------
--    00010        26389    262267   8947716664  ||       1000-21395     16502224---------------------------------------------------------------
--    00020        19447    661924  17895408426  ||      21396-41791      476282|--
--    00030        15085   1186749  26843108413  ||      41792-62187       30838|-
--    00040        11826   1858202  35790808981  ||      62188-82583        2634|-
--    00050         9156   2718822  44738498848  ||      82584-102979        377|-
--    00060         6880   3846362  53686201313  ||     102980-123375         82|-
--    00070         4961   5379157  62633900088  ||     123376-143771         36|-
--    00080         3384   7563481  71581598453  ||     143772-164167         10|-
--    00090         2093  10919410  80529294875  ||     164168-184563         19|-
--    00100         1000  17012571  89476994241  ||     184564-204959          6|-
--    001.000x            17012572  89476994241  ||     204960-225355          6|-
--                                               ||     225356-245751         12|-
--                                               ||     245752-266147          7|-
--                                               ||     266148-286543          5|-
--                                               ||     286544-306939          6|-
--                                               ||     306940-327335          6|-
--                                               ||     327336-347731          2|-
--                                               ||     347732-368127          2|-
--                                               ||     368128-388523          0|
--                                               ||     388524-408919          1|-
--                                               ||     408920-429315          4|-
--                                               ||     429316-449711          2|-
--                                               ||     449712-470107          1|-
--                                               ||     470108-490503          2|-
--                                               ||     490504-510899          0|
--                                               ||     510900-531295          2|-
--                                               ||     531296-551691          1|-
--                                               ||     551692-572087          0|
--                                               ||     572088-592483          0|
--                                               ||     592484-612879          0|
--                                               ||     612880-633275          1|-
--                                               ||     633276-653671          0|
--                                               ||     653672-674067          0|
--                                               ||     674068-694463          1|-
--                                               ||     694464-714859          0|
--                                               ||     714860-735255          0|
--                                               ||     735256-755651          0|
--                                               ||     755652-776047          0|
--                                               ||     776048-796443          0|
--                                               ||     796444-816839          0|
--                                               ||     816840-837235          1|-
--                                               ||     837236-857631          0|
--                                               ||     857632-878027          0|
--                                               ||     878028-898423          0|
--                                               ||     898424-918819          0|
--                                               ||     918820-939215          0|
--                                               ||     939216-959611          1|-
--                                               ||     959612-980007          0|
--                                               ||     980008-1000403         0|
--                                               ||    1000404-1020799         1|-
--

[CORRECTION/MERS]
--
--  16-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1         0                                                                        0.0000 0.0000
--       2-     2 166632968 ***********************************************                        0.0919 0.0037
--       3-     4 243105626 ********************************************************************** 0.1663 0.0083
--       5-     7 228530010 *****************************************************************      0.2747 0.0181
--       8-    11 202321627 **********************************************************             0.3838 0.0335
--      12-    16 178904141 ***************************************************                    0.4860 0.0551
--      17-    22 150774668 *******************************************                            0.5781 0.0830
--      23-    29 121548013 **********************************                                     0.6563 0.1154
--      30-    37  96837436 ***************************                                            0.7199 0.1502
--      38-    46  76944925 **********************                                                 0.7710 0.1858
--      47-    56  61168682 *****************                                                      0.8119 0.2214
--      57-    67  48780641 **************                                                         0.8445 0.2561
--      68-    79  39090943 ***********                                                            0.8706 0.2895
--      80-    92  31532655 *********                                                              0.8916 0.3213
--      93-   106  25618096 *******                                                                0.9086 0.3513
--     107-   121  20956800 ******                                                                 0.9225 0.3796
--     122-   137  17249077 ****                                                                   0.9338 0.4061
--     138-   154  14268858 ****                                                                   0.9432 0.4309
--     155-   172  11871523 ***                                                                    0.9510 0.4541
--     173-   191   9957707 **                                                                     0.9574 0.4757
--     192-   211   8400963 **                                                                     0.9628 0.4959
--     212-   232   7121744 **                                                                     0.9674 0.5148
--     233-   254   6080718 *                                                                      0.9713 0.5324
--     255-   277   5209459 *                                                                      0.9746 0.5489
--     278-   301   4485090 *                                                                      0.9775 0.5644
--     302-   326   3879306 *                                                                      0.9799 0.5789
--     327-   352   3363758                                                                        0.9821 0.5925
--     353-   379   2926168                                                                        0.9839 0.6052
--     380-   407   2558923                                                                        0.9855 0.6172
--     408-   436   2242577                                                                        0.9869 0.6285
--     437-   466   1972145                                                                        0.9881 0.6391
--     467-   497   1742581                                                                        0.9892 0.6490
--     498-   529   1543381                                                                        0.9902 0.6584
--     530-   562   1373857                                                                        0.9910 0.6673
--     563-   596   1227834                                                                        0.9918 0.6757
--     597-   631   1100329                                                                        0.9924 0.6837
--     632-   667    987847                                                                        0.9930 0.6912
--     668-   704    888683                                                                        0.9936 0.6984
--     705-   742    803260                                                                        0.9941 0.7052
--     743-   781    726228                                                                        0.9945 0.7117
--     782-   821    659351                                                                        0.9949 0.7179
--
--           0 (max occurrences)
-- 89035034101 (total mers, non-unique)
--  1813969049 (distinct mers, non-unique)
--           0 (unique mers)
skoren commented 2 years ago

Those reads don't look high quality enough based on the 16-mer distribution but I'd still say try the uncorrected option. The k-mer histogram may look different w/the larger k-mer and the homopolymers compressed. If it still looks like the above, then your reads don't look high enough quality to skip correction. You can include the suggestions on the FAQ to decrease the space used.

gushiro commented 2 years ago

@skoren thanks. I did what you suggest, what do you think (see below)? I can see there is peak ( frequency of 2) but not sure if it means the reads are of enough quality to skip the correction step.

PD: I am copying only part of the entire file.

Number of 22-mers that are:
  unique                      0  (exactly one instance of the kmer is in the input)
  distinct           4030482973  (non-redundant kmer sequences in the input)
  present           32827508257  (...)
  missing        17588155561443  (non-redundant kmer sequences not in the input)

             number of   cumulative   cumulative     presence
              distinct     fraction     fraction   in dataset
frequency        kmers     distinct        total       (1e-6)
--------- ------------ ------------ ------------ ------------
        2   1254841656       0.3113       0.0765     0.000061
        3    662503087       0.4757       0.1370     0.000091
        4    433655786       0.5833       0.1898     0.000122
        5    313489010       0.6611       0.2376     0.000152
        6    237268518       0.7200       0.2809     0.000183
        7    184059664       0.7656       0.3202     0.000213
        8    145135980       0.8016       0.3556     0.000244
        9    115845145       0.8304       0.3873     0.000274
       10     93407136       0.8535       0.4158     0.000305
       11     75998201       0.8724       0.4412     0.000335
       12     62388033       0.8879       0.4641     0.000366
       13     51678258       0.9007       0.4845     0.000396
       14     43207931       0.9114       0.5029     0.000426
       15     36441369       0.9205       0.5196     0.000457
       16     31008624       0.9282       0.5347     0.000487
       17     26607229       0.9348       0.5485     0.000518
       18     22995514       0.9405       0.5611     0.000548
       19     20007515       0.9454       0.5727     0.000579
       20     17517691       0.9498       0.5833     0.000609
       21     15425425       0.9536       0.5932     0.000640
       22     13658537       0.9570       0.6024     0.000670
       23     12145486       0.9600       0.6109     0.000701
       24     10853354       0.9627       0.6188     0.000731
       25      9738236       0.9651       0.6262     0.000762
       26      8769288       0.9673       0.6332     0.000792
       27      7927700       0.9693       0.6397     0.000822
       28      7188911       0.9710       0.6458     0.000853
       29      6544789       0.9727       0.6516     0.000883
       30      5970442       0.9741       0.6571     0.000914
       31      5464074       0.9755       0.6622     0.000944
       32      5022570       0.9767       0.6671     0.000975
       33      4618440       0.9779       0.6718     0.001005
       34      4257592       0.9789       0.6762     0.001036
       35      3942236       0.9799       0.6804     0.001066
skoren commented 2 years ago

Are these counted in homopolymers compressed space?

gushiro commented 2 years ago

@skoren It is the file ../0-mercounts/spp_sup.ms22.histogram created after I run Canu as: canu -d ID -p spp_sup minReadLength=5000 genomeSize=2.5g -untrimmed correctedErrorRate=0.12 maxInputCoverage=100 'batOptions=-eg 0.10 -sb 0.01 -dg 2 -db 1 -dr 3' -pacbio-hifi my.ont.fastq.gz

skoren commented 2 years ago

I don't see any peak there (other than the error peak) so I don't think skipping correction will work.

gushiro commented 2 years ago

@skoren , what about this new distribution (corrected reads with masurca):

Looks better to me but I wonder if you think are OK to avoid the correction step.

Number of 22-mers that are:
  unique                      0  (exactly one instance of the kmer is in the input)
  distinct           1827589066  (non-redundant kmer sequences in the input)
  present           40744590960  (...)
  missing        17590358455350  (non-redundant kmer sequences not in the input)

             number of   cumulative   cumulative     presence
              distinct     fraction     fraction   in dataset
frequency        kmers     distinct        total       (1e-6)
--------- ------------ ------------ ------------ ------------
        2     39522448       0.0216       0.0019     0.000049
        3     35123122       0.0408       0.0045     0.000074
        4     51154993       0.0688       0.0095     0.000098
        5     72721524       0.1086       0.0185     0.000123
        6     94378455       0.1603       0.0324     0.000147
        7    111619843       0.2213       0.0515     0.000172
        8    121004121       0.2876       0.0753     0.000196
        9    121485640       0.3540       0.1021     0.000221
       10    114863871       0.4169       0.1303     0.000245
       11    102955636       0.4732       0.1581     0.000270
       12     90036682       0.5225       0.1846     0.000295
       13     77830786       0.5651       0.2095     0.000319
       14     67734055       0.6021       0.2328     0.000344
       15     60347079       0.6351       0.2550     0.000368
       16     55038408       0.6653       0.2766     0.000393
       17     51238253       0.6933       0.2980     0.000417
       18     48005332       0.7196       0.3192     0.000442
       19     44818322       0.7441       0.3401     0.000466
       20     41478396       0.7668       0.3604     0.000491
       21     37879475       0.7875       0.3799     0.000515
       22     34138346       0.8062       0.3984     0.000540
       23     30243904       0.8227       0.4155     0.000564
       24     26619910       0.8373       0.4311     0.000589
       25     23277577       0.8500       0.4454     0.000614
       26     20353915       0.8612       0.4584     0.000638
       27     17818787       0.8709       0.4702     0.000663
       28     15673425       0.8795       0.4810     0.000687
       29     13896377       0.8871       0.4909     0.000712
       30     12482389       0.8939       0.5001     0.000736
       31     11268303       0.9001       0.5086     0.000761
       32     10247712       0.9057       0.5167     0.000785
skoren commented 2 years ago

Yes, this looks better as you know have a peak at the 8-9x range though that's still pretty low given you should have 30x+ coverage.

skoren commented 2 years ago

Idle, answered