marbl / verkko

Telomere-to-telomere assembly of accurate long reads (PacBio HiFi, Oxford Nanopore Duplex, HERRO corrected Oxford Nanopore Simplex) and Oxford Nanopore ultra-long reads.
294 stars 29 forks source link

`1-buildGraph`: terminate called after throwing an instance of 'std::bad_alloc' #38

Closed AndreaGuarracino closed 2 years ago

AndreaGuarracino commented 2 years ago

Hi, I am trying your promising pipeline with Rattus rattus HiFi and ONT data. I've got the following error during graph building:

tail 1-buildGraph/buildGraph.err 

try resolve k=14873, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=14874, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=14926, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=14927, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
25504 unitigs after resolving
Building unitig sequences
Reading sequences from ../0-correction/hifi-corrected.fasta
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
./buildGraph.sh: line 39: 93138 Aborted                 /gnu/store/gsxfh1hm7fs03b7j49kqgja66xdq69zr-mbg-1.0.8/bin/MBG $iopt -t 4 -k 1001 -r 15000 -w 100 --kmer-abundance 1 --unitig-abundance 2 --error-masking=collapse-msat --output-sequence-paths ../1-buildGraph/paths.gaf --out ../1-buildGraph/hifi-resolved.gfa

Is it a memory problem? The machine has ~160GB of free RAM available.

This is the directory content:

-rwxr-xr-x 1 andreag andreag  439 Jan 24 03:03 snakemake.sh
-rw-r--r-- 1 andreag andreag 2.7K Jan 24 03:03 verkko.yml

0-correction:
total 23G
-rw-r--r-- 1 andreag andreag  555 Jan 24 03:12 buildStore.err
-rwxr-xr-x 1 andreag andreag  391 Jan 24 03:03 buildStore.sh
-rw-r--r-- 1 andreag andreag  117 Jan 24 22:39 combineErrors.err
-rwxr-xr-x 1 andreag andreag  446 Jan 24 22:31 combineErrors.sh
-rw-r--r-- 1 andreag andreag  16K Jan 24 19:47 combineOverlaps.err
-rwxr-xr-x 1 andreag andreag  820 Jan 24 19:47 combineOverlaps.sh
-rw-r--r-- 1 andreag andreag    0 Jan 24 19:47 configureFindErrors.err
-rw-r--r-- 1 andreag andreag    0 Jan 24 19:47 configureFindErrors.finished
-rwxr-xr-x 1 andreag andreag  581 Jan 24 19:47 configureFindErrors.sh
-rw-r--r-- 1 andreag andreag 9.8K Jan 24 03:12 configureOverlaps.err
-rw-r--r-- 1 andreag andreag    0 Jan 24 03:12 configureOverlaps.finished
-rwxr-xr-x 1 andreag andreag  749 Jan 24 03:12 configureOverlaps.sh
-rw-r--r-- 1 andreag andreag  15K Jan 24 03:14 countKmers.err
-rwxr-xr-x 1 andreag andreag 1.7K Jan 24 03:12 countKmers.sh
drwxr-xr-x 2 andreag andreag  16K Jan 24 22:31 find-errors-jobs
-rw-r--r-- 1 andreag andreag  23G Jan 24 22:39 hifi-corrected.fasta
-rw-r--r-- 1 andreag andreag  27M Jan 24 03:14 hifi.ignoremers
drwxr-xr-x 2 andreag andreag 4.0K Jan 24 19:47 hifi.ovlStore
-rw-r--r-- 1 andreag andreag 4.7M Jan 24 19:47 hifi.ovlStore.config
drwxr-xr-x 2 andreag andreag 4.0K Jan 24 03:12 hifi.seqStore
-rw-r--r-- 1 andreag andreag 1.6K Jan 24 19:47 ovb-files
drwxr-xr-x 2 andreag andreag  20K Jan 24 19:47 overlap-jobs
-rw-r--r-- 1 andreag andreag  26M Jan 24 22:39 red.red

1-buildGraph:
total 516K
-rw-r--r-- 1 andreag andreag 508K Jan 24 23:21 buildGraph.err
-rwxr-xr-x 1 andreag andreag 1.5K Jan 24 22:39 buildGraph.sh

3-align:
total 12M
drwxr-xr-x 2 andreag andreag 4.0K Jan 24 03:21 split
-rw-r--r-- 1 andreag andreag  12M Jan 24 03:21 splitONT.err
-rwxr-xr-x 1 andreag andreag  332 Jan 24 03:03 splitONT.sh

I attach also the logs, in case it can help: 2022-01-24T030309.584137.snakemake.log buildGraph.err.txt

brianwalenz commented 2 years ago

Those types of errors are typically caused by the application requesting memory and the operating system refusing the request. This could be either because the machine is out of memory (both physical RAM and swap) or because the process hit some policy limit. The ulimit -a command will show policy limits. Other than watching top as it runs, I don't know of any out-of-the-box way of monitoring memory usage. On the bright side, it looks like it ran in 40 minutes.

I sadly don't know this step intimately enough to offer any parameter tweaks (and everyone else is still warm and cozy in bed).

Is this public data? Can you point to it, if only to give us something else to run on?

AndreaGuarracino commented 2 years ago

Hi @brianwalenz, thank you for your quick reply.

Unfortunately, the data is not yet public. The ulimit -a output seems fine:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1031389
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1031389
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

I am not sure if the overcommit handling mode could be the guilty here:

cat /proc/sys/vm/overcommit_memory
2

Indeed (from here):

2   -   Don't overcommit. The total address space commit
        for the system is not permitted to exceed swap + a
        configurable amount (default is 50%) of physical RAM.
        Depending on the amount you use, in most situations
        this means a process will not be killed while accessing
        pages but will receive errors on memory allocation as
        appropriate.

However, as soon as possible, I will retry the same command line (all default) with the same data on a machine with ~2/4X more RAM, and I will let you know the result.

As for monitoring the memory usage, besides [h]topping, I would suggest updating verkko in a way that it runs the commands/scripts by adding \time -v (or /usr/bin/time -v or similar) at the beginning of the command lines. In that way, you would get the Maximum resident set size (kbytes) information also in case of failing executions. Perhaps, snakemake has something like this too.

skoren commented 2 years ago

Were you able to test this on a larger node?

AndreaGuarracino commented 2 years ago

Hi @skoren, I haven't been able to run the same test on a larger node yet (the other cluster doesn't like snakemake).

Since the node where I did the first test now has more RAM available (~ 210GB), what I did was to update verkko to the 1633e6e5a07202e64ff48c3866ab0fc05c308d7d commit, delete the 1-buildGraph/ folder and re-run the same command line, that is verkko -d shr_verkko --hifi m64247_210428_035639.ccs.fq.gz --nano fastq.tar.gz --threads 48.

The creation of the graph was successful and now it is still graph-aligning ONT reads against the 2-processGraph/unitig-unrolled-hifi-resolved.gfa graph. By htopping during graph building, I didn't see big a memory consumption, so I am not sure why the first time the process died. The problem solved itself.

skoren commented 2 years ago

OK, I'll close this then but feel free to open a new issue if you encounter other errors or errors on other samples.