marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
660 stars 179 forks source link

massive .part file = failed consensus #2083

Closed shaferab closed 2 years ago

shaferab commented 2 years ago

Hi, I have ~45x HiFi data (2.7 GB genome). I use the grid option but often run individual sbatch scrips due to some incompatibilities with slurm (i.e. submitting 4000 jobs or oom events). It's worked well and I'm on the consensus phase; however, one of my partition files is 38G, more than double the next closes. This variation has led me to run indivdiual consensus jobs for 57 arrays. This particular file is getting ooms but now running at 32 CPUs @8G per.

I guess my question is does this seem out of the ordinary? And it's not clear to me if I can pick up from XXX.cns.WORKING if this times out? And if so, is that from the general canu script or the one specific to the module (below). Much appreciated.

  sbatch \
  --cpus-per-task=32 --mem-per-cpu=8G --account=rrg-shaferab --time=1-23:59:00 -o consensus.%A_%a.out \
  -D `pwd` -J "cns_Sc" \
  -a 52 \
  `pwd`/consensus.sh 0 \
> ./consensus.jobSubmit-01.out 2>&1
skoren commented 2 years ago

It's strange to have 4000 jobs for such a small genome and low coverage. Can you post the report file to see what the data/assembly look like?

Consensus typically does have a larger partition because one gets all the small contigs which run faster individually. The steps are all atomic so if it fails, you'd have to restart the consensus from scratch for that job. You can modify the portioning parameters to create more jobs but then you'd have to re-run all the partitions. I'd let the job run and see what progress it is making.

shaferab commented 2 years ago

Thanks, yeah I have been monitoring progress and asks; I'm confident I'll get 56/57 finished.

I should clarify perhaps; canu submitted 4300 jobs for the overlapper phase - is that more common? I had previously trimmed the reads. Here is the report.

[shaferab@gra-login1 Sc-pacbio_2nodes]$ cat Sc.report 
[TRIMMING/READS]
--
-- In sequence store './Sc.seqStore':
--   Found 27866658 reads.
--   Found 526762853342 bases (195.09 times coverage).
--    Histogram of corrected reads:
--    
--    G=526762853342                     sum of  ||               length     num
--    NG         length     index       lengths  ||                range    seqs
--    ----- ------------ --------- ------------  ||  ------------------- -------
--    00010       146418    293178  52676307819  ||       1156-12583     9973556|--------------------------------------------
--    00020        87797    754679 105352599017  ||      12584-24011     14514741---------------------------------------------------------------
--    00030        43464   1620381 158028877955  ||      24012-35439     1327489|------
--    00040        24485   3282985 210705147645  ||      35440-46867      557061|---
--    00050        18677   5812040 263381441119  ||      46868-58295      302656|--
--    00060        16309   8845159 316057715387  ||      58296-69723      204114|-
--    00070        14665  12257266 368734001794  ||      69724-81151      155621|-
--    00080        13262  16034317 421410284641  ||      81152-92579      127033|-
--    00090        11513  20264116 474086572273  ||      92580-104007     108309|-
--    00100         1156  27866657 526762853342  ||     104008-115435      94085|-
--    001.000x            27866658 526762853342  ||     115436-126863      83503|-
--                                               ||     126864-138291      75657|-
--                                               ||     138292-149719      69024|-
--                                               ||     149720-161147      63564|-
--                                               ||     161148-172575      55134|-
--                                               ||     172576-184003      46713|-
--                                               ||     184004-195431      37359|-
--                                               ||     195432-206859      28051|-
--                                               ||     206860-218287      19543|-
--                                               ||     218288-229715      11511|-
--                                               ||     229716-241143       5946|-
--                                               ||     241144-252571       2667|-
--                                               ||     252572-263999       1206|-
--                                               ||     264000-275427        614|-
--                                               ||     275428-286855        388|-
--                                               ||     286856-298283        308|-
--                                               ||     298284-309711        242|-
--                                               ||     309712-321139        160|-
--                                               ||     321140-332567        119|-
--                                               ||     332568-343995         81|-
--                                               ||     343996-355423         58|-
--                                               ||     355424-366851         34|-
--                                               ||     366852-378279         35|-
--                                               ||     378280-389707         19|-
--                                               ||     389708-401135         12|-
--                                               ||     401136-412563         15|-
--                                               ||     412564-423991         10|-
--                                               ||     423992-435419          5|-
--                                               ||     435420-446847          4|-
--                                               ||     446848-458275          2|-
--                                               ||     458276-469703          2|-
--                                               ||     469704-481131          1|-
--                                               ||     481132-492559          2|-
--                                               ||     492560-503987          0|
--                                               ||     503988-515415          3|-
--                                               ||     515416-526843          0|
--                                               ||     526844-538271          0|
--                                               ||     538272-549699          0|
--                                               ||     549700-561127          0|
--                                               ||     561128-572555          1|-
--

[UNITIGGING/READS]
--
-- In sequence store './Sc.seqStore':
--   Found 27866658 reads.
--   Found 526762853342 bases (195.09 times coverage).
--    Histogram of corrected-trimmed reads:
--    
--    G=526762853342                     sum of  ||               length     num
--    NG         length     index       lengths  ||                range    seqs
--    ----- ------------ --------- ------------  ||  ------------------- -------
--    00010       146418    293178  52676307819  ||       1156-12583     9973556|--------------------------------------------
--    00020        87797    754679 105352599017  ||      12584-24011     14514741---------------------------------------------------------------
--    00030        43464   1620381 158028877955  ||      24012-35439     1327489|------
--    00040        24485   3282985 210705147645  ||      35440-46867      557061|---
--    00050        18677   5812040 263381441119  ||      46868-58295      302656|--
--    00060        16309   8845159 316057715387  ||      58296-69723      204114|-
--    00070        14665  12257266 368734001794  ||      69724-81151      155621|-
--    00080        13262  16034317 421410284641  ||      81152-92579      127033|-
--    00090        11513  20264116 474086572273  ||      92580-104007     108309|-
--    00100         1156  27866657 526762853342  ||     104008-115435      94085|-
--    001.000x            27866658 526762853342  ||     115436-126863      83503|-
--                                               ||     126864-138291      75657|-
--                                               ||     138292-149719      69024|-
--                                               ||     149720-161147      63564|-
--                                               ||     161148-172575      55134|-
--                                               ||     172576-184003      46713|-
--                                               ||     184004-195431      37359|-
--                                               ||     195432-206859      28051|-
--                                               ||     206860-218287      19543|-
--                                               ||     218288-229715      11511|-
--                                               ||     229716-241143       5946|-
--                                               ||     241144-252571       2667|-
--                                               ||     252572-263999       1206|-
--                                               ||     264000-275427        614|-
--                                               ||     275428-286855        388|-
--                                               ||     286856-298283        308|-
--                                               ||     298284-309711        242|-
--                                               ||     309712-321139        160|-
--                                               ||     321140-332567        119|-
--                                               ||     332568-343995         81|-
--                                               ||     343996-355423         58|-
--                                               ||     355424-366851         34|-
--                                               ||     366852-378279         35|-
--                                               ||     378280-389707         19|-
--                                               ||     389708-401135         12|-
--                                               ||     401136-412563         15|-
--                                               ||     412564-423991         10|-
--                                               ||     423992-435419          5|-
--                                               ||     435420-446847          4|-
--                                               ||     446848-458275          2|-
--                                               ||     458276-469703          2|-
--                                               ||     469704-481131          1|-
--                                               ||     481132-492559          2|-
--                                               ||     492560-503987          0|
--                                               ||     503988-515415          3|-
--                                               ||     515416-526843          0|
--                                               ||     526844-538271          0|
--                                               ||     538272-549699          0|
--                                               ||     549700-561127          0|
--                                               ||     561128-572555          1|-
--

[UNITIGGING/MERS]
--
--  22-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1         0                                                                        0.0000 0.0000
--       2-     2 2885432347 ********************************************************               0.2116 0.0191
--       3-     4 3543949771 ********************************************************************** 0.3629 0.0396
--       5-     7 2515499362 *************************************************                      0.5508 0.0771
--       8-    11 1466841065 ****************************                                           0.6916 0.1205
--      12-    16 858693360 ****************                                                       0.7800 0.1615
--      17-    22 504164170 *********                                                              0.8345 0.1977
--      23-    29 301191793 *****                                                                  0.8674 0.2276
--      30-    37 204394007 ****                                                                   0.8877 0.2520
--      38-    46 162019995 ***                                                                    0.9019 0.2740
--      47-    56 155273537 ***                                                                    0.9134 0.2963
--      57-    67 194683497 ***                                                                    0.9248 0.3233
--      68-    79 228981723 ****                                                                   0.9393 0.3647
--      80-    92 181699269 ***                                                                    0.9560 0.4205
--      93-   106 105088746 **                                                                     0.9688 0.4704
--     107-   121  64620907 *                                                                      0.9761 0.5035
--     122-   137  51994552 *                                                                      0.9807 0.5273
--     138-   154  45780784                                                                        0.9845 0.5495
--     155-   172  34479164                                                                        0.9878 0.5714
--     173-   191  23918763                                                                        0.9903 0.5896
--     192-   211  18112832                                                                        0.9920 0.6038
--     212-   232  14311213                                                                        0.9933 0.6157
--     233-   254  11004995                                                                        0.9943 0.6261
--     255-   277   8590246                                                                        0.9951 0.6349
--     278-   301   6887950                                                                        0.9958 0.6423
--     302-   326   5556928                                                                        0.9963 0.6489
--     327-   352   4531961                                                                        0.9967 0.6546
--     353-   379   3754143                                                                        0.9970 0.6597
--     380-   407   3139690                                                                        0.9973 0.6642
--     408-   436   2666075                                                                        0.9975 0.6682
--     437-   466   2287324                                                                        0.9977 0.6719
--     467-   497   1996071                                                                        0.9978 0.6753
--     498-   529   1760775                                                                        0.9980 0.6785
--     530-   562   1563917                                                                        0.9981 0.6815
--     563-   596   1398309                                                                        0.9982 0.6843
--     597-   631   1246876                                                                        0.9983 0.6870
--     632-   667   1117426                                                                        0.9984 0.6895
--     668-   704   1000212                                                                        0.9985 0.6919
--     705-   742    906108                                                                        0.9986 0.6942
--     743-   781    831236                                                                        0.9987 0.6963
--     782-   821    769439                                                                        0.9987 0.6984
--
--           0 (max occurrences)
-- 302249381021 (total mers, non-unique)
-- 13638973170 (distinct mers, non-unique)
--           0 (unique mers)

[UNITIGGING/OVERLAPS]
--   category            reads     %          read length        feature size or coverage  analysis
--   ----------------  -------  -------  ----------------------  ------------------------  --------------------
--   middle-missing     518623    1.86    16285.48 +- 19560.75      5878.19 +- 8552.03    (bad trimming)
--   middle-hump         95635    0.34    54461.70 +- 34489.89     45865.76 +- 33997.00   (bad trimming)
--   no-5-prime        1337085    4.80    11999.74 +- 14697.89      7278.59 +- 9648.14    (bad trimming)
--   no-3-prime        1203484    4.32    12992.64 +- 14319.36      8560.03 +- 10656.44   (bad trimming)
--   
--   low-coverage      6691132   24.01    11068.68 +- 7303.62         37.94 +- 15.70      (easy to assemble, potential for lower quality consensus)
--   unique             298216    1.07    10139.35 +- 3018.58        144.16 +- 63.18      (easy to assemble, perfect, yay)
--   repeat-cont        507529    1.82    10329.98 +- 1852.19       1144.77 +- 803.38     (potential for consensus errors, no impact on assembly)
--   repeat-dove          5287    0.02    16040.81 +- 1680.89        922.50 +- 590.18     (hard to assemble, likely won't assemble correctly or even at all)
--   
--   span-repeat       1652304    5.93    10899.87 +- 4466.47       3641.59 +- 3372.03    (read spans a large repeat, usually easy to assemble)
--   uniq-repeat-cont  1441999    5.17    10280.91 +- 3428.68                             (should be uniquely placed, low potential for consensus errors, no impact on assembly)
--   uniq-repeat-dove   199969    0.72    13515.45 +- 3845.77                             (will end contigs, potential to misassemble)
--   uniq-anchor         58792    0.21    11129.65 +- 2126.80       3941.75 +- 2525.74    (repeat read, with unique section, probable bad read)

[UNITIGGING/ADJUSTMENT]
-- No report available.

[UNITIGGING/ERROR RATES]
--  
--  ERROR RATES
--  -----------
--                                                   --------threshold------
--  12510315                     fraction error      fraction        percent
--  samples                              (1e-5)         error          error
--                   --------------------------      --------       --------
--  command line (-eg)                           ->   3500.00        3.5000%  (enabled)
--  command line (-ef)                           ->  -----.--      ---.----%
--  command line (-eM)                           ->   3500.00        3.5000%  (enabled)
--  mean + std.dev     358.55 +-  12 *   816.78  ->  10159.91       10.1599%
--  median + mad         0.00 +-  12 *     0.00  ->      0.00        0.0000%
--  90th percentile                              ->   1531.00        1.5310%
--  
--  BEST EDGE FILTERING
--  -------------------
--  At graph threshold 3.5000%, reads:
--    available to have edges:     19514378
--    with at least one edge:       2149552
--  
--  At max threshold 3.5000%, reads:
--    available to have edges:     19514378
--    with at least one edge:       2149552
--  
--  At tight threshold 1.5310%, reads with:
--    both edges below error threshold:   1319401  (80.00% minReadsBest threshold = 1719641)
--    one  edge  above error threshold:    631796
--    both edges above error threshold:    198355
--    at least one edge:                  2149552
--  
--  At loose threshold 3.5000%, reads with:
--    both edges below error threshold:   2149552  (80.00% minReadsBest threshold = 1719641)
--    one  edge  above error threshold:         0
--    both edges above error threshold:         0
--    at least one edge:                  2149552
--  
--  
--  INITIAL EDGES
--  -------- ----------------------------------------
--   8280442 reads are contained
--  17963337 reads have no best edges (singleton)
--     18305 reads have only one best edge (spur) 
--               8741 are mutual best
--   1604574 reads have two best edges 
--             141631 have one mutual best edge
--            1411337 have two mutual best edges
--  
--  
--  FINAL EDGES
--  -------- ----------------------------------------
--   8280442 reads are contained
--  17983620 reads have no best edges (singleton)
--     15224 reads have only one best edge (spur) 
--               8918 are mutual best
--   1587372 reads have two best edges 
--             134864 have one mutual best edge
--            1411110 have two mutual best edges
--  
--  
--  EDGE FILTERING
--  -------- ------------------------------------------
--         0 reads are ignored
--    922549 reads have a gap in overlap coverage
--     11092 reads have lopsided best edges

[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
--   contigs:      35593 sequences, total length 2784828001 bp (including 34946 repeats of total length 813887353 bp).
--   bubbles:      35145 sequences, total length 848801757 bp.
--   unassembled:  18178076 sequences, total length 220604741407 bp.
--
-- Contig sizes based on genome size 2.7gbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     1153548           155   270770228
--     20      752251           450   540711693
--     30      533999           878   810266116
--     40      381898          1478  1080015113
--     50      261277          2337  1350257365
--     60      174190          3614  1620014777
--     70      117024          5519  1890102305
--     80       69357          8480  2160025284
--     90       28037         15121  2430008182
--    100       13722         28410  2700010540
skoren commented 2 years ago

The report says 195x not 45x that you had mentioned originally. That doesn't look like HiFi data though, it's way too long and there's almost no peak in the histogram. The assembly is also quite fragmented. It also looks like you're running with higher error rate and coverage than the defaults for HiFi data, what's your full command?

shaferab commented 2 years ago

Okay. Here is the data output from the sequencing facilty

Name HiFi_Reads HIFi_BP HiFi_mean_len SC7_HTpool_42PM_CELL2 1,316,217 19,255,575,399 14,629 SC7_HTpool_50PM_CELL1 1,339,709 19,554,913,662 14,596 SC7_HTpool_33PM_CELL3 1,780,469 26,160,968,655 14,693 SC7_HTpool_33PM_CELL4 1,810,309 26,597,809,556 14,692 SC7_HTpool_33PM_CELL5 1,654,212 24,236,775,031 14,651

Command below. But I think it seems like we might have read in the CLR or additioanl data somehow. Sounds like I should restart from the beginning? Before that - command below:

module load gcc/9.3.0 module load canu/2.2

canu \ -p Sc -d Sc-pacbio_2nodes \ genomeSize=2.7g \ -pacbio-hifi HiFi_reads/trim-cell[1-5]* \ correctedErrorRate=0.035 \ utgOvlErrorRate=0.065 \ trimReadsCoverage=2 \ trimReadsOverlap=500 \ -maxMemory=128G \ -maxThreads=64 \ -executiveMemory=16 \ gridOptions="--account=rrg-shaferab --cpus-per-task=32 --time=05:59:00"

skoren commented 2 years ago

Yeah, I'd restart from the beginning. I think what you have is the SequelIe output which outputs 1 read per ZMW, even if it's a CLR read. Filter the files for Q20 reads first. You also shouldn't need the correctedErrorRate=0.035 utgOvlErrorRate=0.065 trimReadsCoverage=2 trimReadsOverlap=500 parameters then.

shaferab commented 2 years ago

for q20 reads - that extracts reads that have an average Q>20, so I can use any QC program?

Much appreciated your quick and detailed feedback, I imagine this will run much smoother once I get rid of the garbage.

skoren commented 2 years ago

I've used the PacBio dataset command before: https://www.pacb.com/wp-content/uploads/SMRT_Tools_Reference_Guide_v10.1.pdf, see page 15. I assume other tools would work. If you have the BAMs for the cells, there should be a tag indicating QV for each read as well that you can use to filter.

skoren commented 2 years ago

Idle, input was not HiFi data and likely needed filtering.