Yet another "Read correction jobs failed". Memory related?

lmolokin commented 3 years ago

Hi, I'm attempting to correct very high coverage amplicon reads generated by minION and can't figure out the specific reason it keeps failing even after scouring the logs. I need to make sure all reads are corrected because we're looking for low abundance variants in our reads hence the high coverage.

The command I'm using is:

canu \
-correct \
-p $i \
-d ./canu/${i}_blasto \
genomeSize=1.7k \
minOverlapLength=1000 \
corOutCoverage=1000000 \
maxInputCoverage=1000000 \
correctedErrorRate=0.12 \
ovsMemory=20 \
corMemory=20 \
gridOptions="--partition=priority --qos=emfsl --mem-per-cpu=10g" \
-nanopore ./${i}_blasto_q14.fastq.gz

Based on what I've read in other threads here, it might be related to running out of memory or with how changing certain settings affects the partitioning of reads. I've had similar correction runs with other deep coverage amplicon datasets which did successfully finish due to some combination of restarting failed runs with more corMemory, ovsMemory, deleting the 2-correction folder, or the entire run folder. So I'm trying to pin down exactly which settings I need to use when working with deep coverage data like this.

I would greatly appreciate some help figuring out which parameters to probe. Thanks!

Relevant logs and outputs below.

Latest canu.out

Found perl:
   /usr/bin/perl
   This is perl 5, version 16, subversion 3 (v5.16.3) built for x86_64-linux-thread-multi

Found java:
   /software/7/apps/java/1.8.0_121/bin/java
   java version "1.8.0_121"

Found canu:
   /lustre/project/software/7/apps/canu/2.1/bin/canu
   canu 2.1

-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_121' (from 'java') with -d64 support.
-- Detected gnuplot version '4.6 patchlevel 2   ' (from 'gnuplot') and image format 'png'.
-- Detected 72 CPUs and 377 gigabytes of memory.
-- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo.
-- Detected Slurm with task IDs up to 10000 allowed.
-- 
-- Found   4 hosts with  80 cores and 1493 GB memory under Slurm control.
-- Found  43 hosts with  40 cores and  122 GB memory under Slurm control.
-- Found   4 hosts with 120 cores and 1493 GB memory under Slurm control.
-- Found  92 hosts with  72 cores and  371 GB memory under Slurm control.
-- Found   2 hosts with  80 cores and  744 GB memory under Slurm control.
-- Found   5 hosts with  80 cores and 1509 GB memory under Slurm control.
-- Found   1 host  with  72 cores and  369 GB memory under Slurm control.
-- Found   2 hosts with  80 cores and  753 GB memory under Slurm control.
--
--                         (tag)Threads
--                (tag)Memory         |
--        (tag)             |         |  algorithm
--        -------  ----------  --------  -----------------------------
-- Grid:  meryl     12.000 GB    4 CPUs  (k-mer counting)
-- Grid:  hap        8.000 GB    4 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap    6.000 GB    8 CPUs  (overlap detection with mhap)
-- Grid:  obtovl     4.000 GB    8 CPUs  (overlap detection)
-- Grid:  utgovl     4.000 GB    8 CPUs  (overlap detection)
-- Grid:  cor       20.000 GB    4 CPUs  (read correction)
-- Grid:  ovb        4.000 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs       20.000 GB    1 CPU   (overlap store sorting)
-- Grid:  red       12.000 GB    4 CPUs  (read error detection)
-- Grid:  oea        8.000 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat       16.000 GB    4 CPUs  (contig construction with bogart)
-- Grid:  cns        -.--- GB    4 CPUs  (consensus)
--
-- In 's168.seqStore', found Nanopore reads:
--   Nanopore:                 1
--
--   Raw:                      1
--
-- Generating assembly 's168' in '/90daydata/giardia_wgs/canu/s168_blasto':
--    - only correct raw reads.
--
-- Parameters:
--
--  genomeSize        1700
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.3200 ( 32.00%)
--    obtOvlErrorRate 0.1200 ( 12.00%)
--    utgOvlErrorRate 0.1200 ( 12.00%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.5000 ( 50.00%)
--    obtErrorRate    0.1200 ( 12.00%)
--    utgErrorRate    0.1200 ( 12.00%)
--    cnsErrorRate    0.1200 ( 12.00%)
--
--
-- BEGIN CORRECTION
--
-- Set corMinCoverage=4 based on read coverage of 109554.
--
-- Read correction jobs failed, tried 2 times, giving up.
--   job 2-correction/results/0001.cns FAILED.
--

ABORT:
ABORT: canu 2.1
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:

Abbreviated contents of /2-correction/results/0001.err showing the falconsense error

cat results/0001.err
-- Opening seqStore '../../s168.seqStore'.
sqCache: found 103456 raw reads with 186243144 bases.
-- Opening corStore '../s168.corStore' version 1.
-- Loading list of reads to process from './s168.readsToCorrect'.
    read    read evidence     corrected
      ID  length    reads       regions
-------- ------- -------- ------------- ...
       3    1800      104      0-1799  (  1799) memory act   15306752 est  543781856 act/est 2.81
       8    1809        4     10-1779  (  1778) memory act          0 est  540654572 act/est 0.00
      10    1803       50      0-1786  (  1790) memory act    1843200 est  542080334 act/est 0.34
      14    1792       39      0-1789  (  1779) memory act          0 est  541621346 act/est 0.00
      16    1793       90      0-1793  (  1789) memory act    1081344 est  543255722 act/est 0.20
      20    1842        4      0-1827  (  1826) memory act          0 est  540953426 act/est 0.00
      24    1794       20      0-1787  (  1781) memory act          0 est  541031726 act/est 0.00
      29    1793       42      0-1799  (  1792) memory act          0 est  541726316 act/est 0.00
      45    1812      114      0-1790  (  1799) memory act    1114112 est  544240514 act/est 0.20
      49    1790        7      0-1774  (  1773) memory act          0 est  540577988 act/est 0.00
      53    1797        5      3-1782  (  1780) memory act          0 est  540578072 act/est 0.00
      58    1852       46      0-1845  (  1837) memory act          0 est  542410592 act/est 0.00
      61    1900        8      0-1822  (  1866) memory act          0 est  541607576 act/est 0.00
      65    1795        9      4-1779  (  1779) memory act          0 est  540686768 act/est 0.00
      68    1791        5      0-1776  (  1778) memory act          0 est  540523838 act/est 0.00
      73    1853       10      0-1834  (  1837) memory act          0 est  541245122 act/est 0.00
      75    1809       11      0-1780  (  1796) memory act          0 est  540881156 act/est 0.00
      76    1802        6      8-1775  (  1772) memory act          0 est  540654326 act/est 0.00
      83    1838       28      0-1834  (  1825) memory act          0 est  541702448 act/est 0.00
      88    1791       70      0-1788  (  1778) memory act          0 est  542602550 act/est 0.00
      91    1789       16      0-1781  (  1776) memory act          0 est  540856922 act/est 0.00
      95    1847       23      0-1833  (  1831) memory act          0 est  541618874 act/est 0.00
     102    1853     4619      0-1845  (  1852) memory act  133693440 est  691679582 act/est 19.33
     109    1829        4     16-1813  (  1788) memory act          0 est  540835010 act/est 0.00
     116    1834       53      0-1828  (  1821) memory act          0 est  542483582 act/est 0.00
     123    1778       54      0-1787  (  1765) memory act          0 est  541960226 act/est 0.00
     124    1786        4      8-1777  (  1764) memory act          0 est  540446630 act/est 0.00
     125    1786        5      0-1782  (  1770) memory act          0 est  540478508 act/est 0.00
     130    1788      456      0-1787  (  1775) memory act          0 est  554893994 act/est 0.00
     131    1799      183      0-1800  (  1798) memory act          0 est  546309710 act/est 0.00
     140    1789        4     11-1779  (  1765) memory act          0 est  540473504 act/est 0.00
     142    1788        4      0-1734  (  1759) memory act          0 est  540464582 act/est 0.00
     143    1806       13      0-1780  (  1788) memory act          0 est  540915842 act/est 0.00
     147    1781       53      0-1782  (  1768) memory act          0 est  541954700 act/est 0.00
     159    1836        4     10-1771  (  1766) memory act          0 est  540897194 act/est 0.00
     162    1804      155      0-1787  (  1791) memory act          0 est  545470916 act/est 0.00
     163    1805        5      3-1784  (  1788) memory act          0 est  540650816 act/est 0.00
     164    1798        5      0-1779  (  1785) memory act          0 est  540587408 act/est 0.00
     165    1801       47      0-1787  (  1790) memory act          0 est  541955642 act/est 0.00
     166    1848       21      0-1830  (  1832) memory act          0 est  541556822 act/est 0.00
     177    1792        7      0-1784  (  1779) memory act          0 est  540596894 act/est 0.00
     185    1846       15      0-1829  (  1833) memory act          0 est  541351400 act/est 0.00
     190    1784       78      0-1785  (  1771) memory act          0 est  542783312 act/est 0.00
     199    1783       90      0-1782  (  1770) memory act          0 est  543155270 act/est 0.00

# Data cut to reduce size of this post

   14346    1802       85      0-1786  (  1789) memory act          0 est  543194522 act/est 0.00
   14359    1800       97      0-1786  (  1787) memory act          0 est  543561608 act/est 0.00
   14371    1830        4     11-1761  (  1752) memory act          0 est  540842780 act/est 0.00
   14375    1724        9      0-1687  (  1708) memory act          0 est  540038546 act/est 0.00
   14379    1807       69      0-1785  (  1794) memory act          0 est  542733026 act/est 0.00
   14385    1784      807      0-1789  (  1771) memory act          0 est  565983512 act/est 0.00
   14387    1846       10      0-1835  (  1821) memory act          0 est  541185566 act/est 0.00
   14388    1803        7      0-1782  (  1790) memory act          0 est  540696584 act/est 0.00
 falconsense: correction/falconConsensus.C:284: falconData* falconConsensus::getConsensus(uint32, alignTagList**, uint32): Assertion `0 <= fd->eqv[fd->len]' failed.
  14389    1815      503      0-1800  (  1814) memory act          0 est  556876496 act/est 0.00
   14394    1803      252      0-1798  (  1789) memory act          0 est  548570702 act/est 0.00
   14397    1789       24      0-1780  (  1776) memory act          0 est  541110362 act/est 0.00
   14403    1794     7146      0-1800  (  1793) memory act          0 est  769164374 act/est 0.00
   14407    1824        8      0-1811  (  1803) memory act          0 est  540919028 act/est 0.00
   14419    1785       11      0-1782  (  1772) memory act          0 est  540660278 act/est 0.00
   14427    1784        7      4-1782  (  1765) memory act          0 est  540523322 act/est 0.00
   14432    1788       53      0-1781  (  1775) memory act          0 est  542027342 act/est 0.00
   14436    1842       49      0-1821  (  1816) memory act          0 est  542408042 act/est 0.00
   14439    1801       33      0-1784  (  1788) memory act          0 est  541514660 act/est 0.00
   14445    1783       16      0-1777  (  1770) memory act          0 est  540801284 act/est 0.00
   14447    1818       35      0-1797  (  1786) memory act          0 est  541725032 act/est 0.00
   14448    1793        6      0-1774  (  1780) memory act          0 est  540574190 act/est 0.00
   14449    1797       30      0-1787  (  1784) memory act          0 est  541379522 act/est 0.00
   14454    1781        5      9-1764  (  1747) memory act          0 est  540432080 act/est 0.00
   14457    1790       14      0-1776  (  1777) memory act          0 est  540802250 act/est 0.00
   14460    1795      189      0-1784  (  1782) memory act          0 est  546454724 act/est 0.00
   14463    1791        8      0-1782  (  1778) memory act          0 est  540620048 act/est 0.00
   14466    1847       97      0-1832  (  1834) memory act          0 est  544057460 act/est 0.00
   14469    1839       24      0-1831  (  1826) memory act          0 est  541568684 act/est 0.00
   14473    1811      300      0-1794  (  1790) memory act          0 est  550151450 act/est 0.00
   14474    1788     3423      0-1799  (  1787) memory act          0 est  649548884 act/est 0.00
   14488    1788       48      0-1783  (  1775) memory act          0 est  541861238 act/est 0.00
   14489    1795       41      0-1783  (  1782) memory act          0 est  541713182 act/est 0.00
   14490    1801       28      0-1785  (  1788) memory act          0 est  541350770 act/est 0.00
   14493    1781       26      0-1790  (  1780) memory act          0 est  541100618 act/est 0.00
   14505    1789        5     14-1769  (  1752) memory act          0 est  540503978 act/est 0.00
   14511    1841        6      0-1782  (  1795) memory act          0 est  541008242 act/est 0.00
   14521    1800       12      0-1785  (  1787) memory act          0 est  540830720 act/est 0.00
   14530    1787      529      0-1789  (  1774) memory act          0 est  557197262 act/est 0.00
   14535    1802       31      0-1797  (  1789) memory act          0 est  541459376 act/est 0.00
   14543    1782       18      0-1783  (  1769) memory act          0 est  540855938 act/est 0.00
   14555    1799        6      5-1778  (  1779) memory act          0 est  540628028 act/est 0.00
   14559    1788        4     25-1779  (  1751) memory act          0 est  540464456 act/est 0.00
   14562    1795      552      0-1800  (  1794) memory act          0 est  558079754 act/est 0.00
   14565    1789       41      0-1801  (  1788) memory act          0 est  541655780 act/est 0.00
   14567    1808       28      0-1785  (  1788) memory act          0 est  541412648 act/est 0.00
   14573    1853        8      7-1838  (  1833) memory act          0 est  541183886 act/est 0.00

Latest job file 2-correction/correctReads.5412718_1.out:

Found perl:
   /usr/bin/perl
   This is perl 5, version 16, subversion 3 (v5.16.3) built for x86_64-linux-thread-multi

Found java:
   /software/7/apps/java/1.8.0_121/bin/java
   java version "1.8.0_121"

Found canu:
   /lustre/project/software/7/apps/canu/2.1/bin/canu
   canu 2.1

Running job 1 based on SLURM_ARRAY_TASK_ID=1 and offset=0.
/var/spool/slurmd/job5412718/slurm_script: line 115: 146034 Aborted                 (core dumped) $bin/falconsense -S $seqStore -C ../s168.corStore -R ./s168.readsToCorrect -r $bgnid-$endid -t 4 -cc 4 -cl 1000 -oi 0.5 -ol 1000 -p ./results/$jobid.WORKING -cns > ./results/$jobid.err 2>&1

Aleksey

skoren commented 3 years ago

The only other time we've seen that same error was issue #1732 and that was due to a bug with POSIX file naming (which was removed in Canu 2.1). So first question is, how was canu build/installed on your system? Are you able to share the input data to reproduce the error locally?

lmolokin commented 3 years ago

Not sure how they installed canu on the cluster that I'm using. Is there a quick way to find out? I could reach out to the admins if needed.

Input file is ~150Mb. What's the best way to share it?

brianwalenz commented 3 years ago

Hold off on contacting admins for now.

gzip -9 it and upload via instructions at https://canu.readthedocs.io/en/latest/faq.html#how-can-i-send-data-to-you.

lmolokin commented 3 years ago

Just uploaded (s168.fastq.gz)

brianwalenz commented 3 years ago

Thanks for the reads. Unfortunately, I haven't been able to reproduce the crash. I'm a little hesitant to 'request' that you upload your entire assembly directory s168_blasto as mine is almost 300 GB. If you want to upload it over the weekend, go ahead (gzip -1 will work well).

To increase the parallelization of the mhap step, I decreased parameter mhapBlockSize from the default of 3000 to 50. This gave me about 800 (!) mhap jobs which ran in about 30 minutes on our grid. With the default value, mhap is wanting about a day and a half; it's still running.

I have a guess at what is going wrong. Limiting input coverage to 50,000 or so should prevent the problem. Splitting this input into two pieces and running each separately should work, hopefully without hurting sensitivity too much.

Drop me an email. The corrected reads I got are only 1.4 MB when gzip -9'd -- about 39 Mbp -- and I can email them back.

brianwalenz commented 3 years ago

Awesome, I can reproduce it. I don't need your assembly directory.

lmolokin commented 3 years ago

I will keep the mhapBlockSize parameter in mind for speed up and give the read splitting strategy a try. Glad to hear you reproduced it! Any more insight on the cause of the crash?

Thanks!

brianwalenz commented 3 years ago

A little bit. Read 14575 is using around 68,000 other reads as evidence which is causing a per-base quality score to overflow. As to why it has so many evidence reads, I don't know. Most other reads have tens to hundreds of evidence reads; it's the third column in the log you shared.

Ah! There is a workaround!

File readsToCorrect is a list of the reads that we want to correct. It's just an ASCII list of read IDs. File .readsToCorrect.log has some details on the expected outcome of each read. In particular, it tells how many evidence reads will be used ("numOlaps").

readID          numOlaps    origLength    corrLength        memory  used
---------- ------------- ------------- ------------- ------------- -----
1                      0          1804             0     540480704 e - -
2                      2          1796             0     540473084 e - -
3                    104          1800          1799     543781856 e c -
4                      0          1784             0     540301184 e - -
...
14573                  8          1853          1833     541183886 e c -
14574                  0          1793             0     540381968 e - -   <- 'e' means it is used as evidence for correcting some other read
14575              68203          1795          1794    2723950022 e c -   <- 'c' means it is corrected
14576                  0          1789             0     540346064 e - -
14577                  0          1791             0     540364016 e - -
...

The workaround is to sort this file on the second column (sort -k2nr *.readsToCorrect.log | less) and remove any read that has more than 60,000 overlaps from the list in readsToCorrect.

There are quite a few 'deep' reads in this set:

> sort -k2nr test.readsToCorrect.log | head -n 20
14575              68203          1795          1794    2723950022 e c -
22039              61696          1865          1831    2525740766 e c -
77550              53235          1798          1797    2247845918 e c -
47565              48698          1801          1800    2105025098 e c -
8965               45274          1815          1814    2006596496 e c -
13378              44039          1798          1797    1952919230 e c -
71966              43641          1805          1804    1946143178 e c -
63099              40035          1848          1847    1834859900 e c -
23709              39484          1798          1797    1806636200 e c -
53823              38715          1805          1804    1787273234 e c -
85274              36669          1797          1796    1715976956 e c -
84097              35358          1796          1795    1673154062 e c -
79025              35133          1845          1832    1670254526 e c -
67161              30966          1794          1793    1531552286 e c -
1079               30396          1800          1799    1516679714 e c -
7213               29968          1797          1796    1501051322 e c -
89926              29780          1844          1831    1502931266 e c -
8270               29357          1800          1799    1483548266 e c -
39577              28860          1789          1788    1461747542 e c -
25417              26031          1862          1851    1379117726 e c -

lmolokin commented 3 years ago

Can confirm that splitting the reads in two results in successful runs.

skoren commented 3 years ago

Fixed in tip, workaround is to split very deep data into multiple partitions.

marbl / canu

Yet another "Read correction jobs failed". Memory related? #1875