tlemane / kmtricks

modular k-mer count matrix and Bloom filter construction for large read collections
GNU Affero General Public License v3.0
72 stars 7 forks source link

Terminate called after throwing an instance of 'std::runtime_error'; Unable to open superkmers/xjin_AB_P0R2c/skp.90 #33

Open bsmith89 opened 5 months ago

bsmith89 commented 5 months ago

I'm getting the following error when I run kmtricks with 276 samples:

$ kmtricks pipeline --kmer-size 111 --hard-min 0 --share-min 1 --soft-min 2 --recurrence-min 3 --file data/group/xjin/r.proc.kmtricks_input.txt --run-dir <DIR> --mode kmer:count:text --threads 24
[2024-03-22 09:26:07.016] [info] Run with Kmer<128> - uint64_t[4] implementation
[2024-03-22 09:26:07.069] [info] Compute configuration...
[2024-03-22 09:26:07.069] [info] 276 samples found (552 read files).
[2024-03-22 09:26:47.828] [info] Use 156 partitions.
[2024-03-22 09:26:48.117] [info] Compute minimizer repartition...
Compute SuperK   [>                                                 ] [00m:00s]                                    
Count partitions [>                                                 ] [00:00s]                                     
terminate called after throwing an instance of 'std::runtime_error'
  what():  Unable to open <DIR>/superkmers/xjin_AB_P0R2c/skp.90
terminate called recursively
[2024-03-22 09:30:47.682] [error] Killed after receive Aborted:SIGABRT(6) signal. Demangled backtrace dumped at ./kmtricks_backtrace.log. If the problem persists, please open an issue with the return of 'kmtricks infos' and the content of ./kmtricks_backtrace.log

./kmtricks_backtrace.log:

Backtrace:
1 0x00007f8d88e41090 (null) + 140245863764112
2 0x00007f8d88e4100b gsignal + 203
3 0x00007f8d88e20859 abort + 299
4 0x00007f8d89213f00 __gnu_cxx::__verbose_terminate_handler() + 192
5 0x00007f8d8921243c (null) + 140245867766844
6 0x00007f8d8921248e (null) + 140245867766926
7 0x000055a817e6c705 (null) + 94180443866885
8 0x00007f8d88e448a7 (null) + 140245863778471
9 0x00007f8d88e44a60 on_exit + 0
10 0x000055a817f2e98a (null) + 94180444662154
11 0x00007f8d88e41090 (null) + 140245863764112
12 0x00007f8d88edb23f clock_nanosleep + 223
13 0x00007f8d88ee0ec7 nanosleep + 23
14 0x000055a817f3f95b (null) + 94180444731739
15 0x000055a8180b14d0 (null) + 94180446246096
16 0x000055a8180b2165 (null) + 94180446249317
17 0x000055a8180b9443 (null) + 94180446278723
18 0x000055a817e52afb main + 1419
19 0x00007f8d88e22083 __libc_start_main + 243
20 0x000055a817e555e5 (null) + 94180443772389

infos:

$ kmtricks infos
kmtricks v1.4.0

- HOST -
build host: Linux-6.1.3
run host: Linux 4.18.0-513.11.1.el8_9.x86_64

- BUILD -
c compiler: GNU 11.2.0
cxx compiler: GNU 11.2.0
conda: ON
static: OFF
native: OFF
modules: ON
socks: ON
howde: ON
dev: OFF
kmer: 32,64,96,128,160,192,224,256
max_c: 4294967295

- GIT SHA1 / VERSION -
kmtricks: 7dc4d18
sdsl: c32874c
bcli: 3e4f493
fmt: 0544a227
kff: 97d135e
lz4: 4de56b3
spdlog: v1.2.1-1811-g5b4c4f3f
xxhash: 6853ddc
gtest: release-1.8.0-2774-g96f4ce02
croaring: v0.3.3-17-g2d5c927
robin-hood-hasing: 24b3f50
turbop: 4ab9f5b
cfrcat: 2f9da97
indicators: v1.9-36-gcdcff01

Contact: teo.lemane@inria.fr

I don't get the same error when I use a subset of 8 samples (instead of 276), nor when I use just 12 threads (instead of 24). This seems pretty clearly related to Issue #15 , where kmtricks is opening too many files. Indeed, lsof confirms this, with the crash occuring just as the number of open files ramps up. When I use 12 threads, <1000 files are opened and it doesn't crash.

While using fewer threads works, I'd love a solution that maintains the high parallelization during other steps. Can you suggest a way to run the kmtricks pipeline so that the superkmers computation step doesn't open too many files, but I get as much parallelization as possible?

Thanks for your help, and for building a valuable tool!

bsmith89 commented 5 months ago

Just to further confirm that the number of open files is the problem:

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 4075957
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4075957
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

With 12 threads, I still haven't exceeded 1000 open files (currently running, but it's gotten past the point where the 24-thread run consistently fails).

bsmith89 commented 5 months ago

Run with 12 threads did eventually fail (4 hours later):

$ kmtricks pipeline                 --kmer-size 111                 --hard-min 0 --share-min 1 --soft-min 2 --recurrence-min 3                 --file data/group/xjin/r.proc.kmtricks_input.txt --run-dir $tmpdir                 --mode kmer:count:text                 --threads 12
[2024-03-22 09:32:04.718] [info] Run with Kmer<128> - uint64_t[4] implementation
[2024-03-22 09:32:04.781] [info] Compute configuration...
[2024-03-22 09:32:04.781] [info] 276 samples found (552 read files).
[2024-03-22 09:32:44.950] [info] Use 156 partitions.
[2024-03-22 09:32:45.632] [info] Compute minimizer repartition...
Compute SuperK   [==================================================] [02h:22m:42s]
Count partitions [==================================================] [02h:22m:44s]
Merge partitions [>                                                 ] [00:00s]
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called after throwing an instance of 'std::runtime_error'
  what():  <DIR>/counts/partition_2/xjin_AS_P2R2c.kmer
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
[2024-03-22 12:00:21.549] [error] Killed after receive Aborted:SIGABRT(6) signal. Demangled backtrace dumped at ./kmtricks_backtrace.log. If the problem persists, please open an issue with the return of 'kmtricks infos' and the content of ./kmtricks_backtrace.log
[2024-03-22 12:00:21.549] [error] Killed after receive Aborted:SIGABRT(6) signal. Demangled backtrace dumped at ./kmtricks_backtrace.log. If the problem persists, please open an issue with the return of 'kmtricks infos' and the content of ./kmtricks_backtrace.log
[2024-03-22 12:00:21.550] [error] Killed after receive Aborted:SIGABRT(6) signal. Demangled backtrace dumped at ./kmtricks_backtrace.log. If the problem persists, please open an issue with the return of 'kmtricks infos' and the content of ./kmtricks_backtrace.log

backtrace:

Backtrace:
1 0x00007f8d88e41090 (null) + 140245863764112
2 0x00007f8d88e4100b gsignal + 203
3 0x00007f8d88e20859 abort + 299
4 0x00007f8d89213f00 __gnu_cxx::__verbose_terminate_handler() + 192
5 0x00007f8d8921243c (null) + 140245867766844
6 0x00007f8d8921248e (null) + 140245867766926
7 0x000055a817e6c705 (null) + 94180443866885
8 0x00007f8d88e448a7 (null) + 140245863778471
9 0x00007f8d88e44a60 on_exit + 0
10 0x000055a817f2e98a (null) + 94180444662154
11 0x00007f8d88e41090 (null) + 140245863764112
12 0x00007f8d88edb23f clock_nanosleep + 223
13 0x00007f8d88ee0ec7 nanosleep + 23
14 0x000055a817f3f95b (null) + 94180444731739
15 0x000055a8180b14d0 (null) + 94180446246096
16 0x000055a8180b2165 (null) + 94180446249317
17 0x000055a8180b9443 (null) + 94180446278723
18 0x000055a817e52afb main + 1419
19 0x00007f8d88e22083 __libc_start_main + 243
20 0x000055a817e555e5 (null) + 94180443772389