rgcgithub / regenie

regenie is a C++ program for whole genome regression modelling of large genome-wide association studies.
https://rgcgithub.github.io/regenie
Other
178 stars 52 forks source link

segmentation fault using LD computation #533

Open josemq opened 1 month ago

josemq commented 1 month ago

Hello,

We're randomly receiving the following error in our HPC environment when Regenie is executed by our researchers:

Script output: /var/spool/slurmd/job484610/slurm_script: line 25: 21475 Segmentation fault /home/rd2972/software/regenie/regenie --bgen ${bgen_path}/01_118839067_144977494_unrelated_EUR_47336_individuals.bgen --sample /home/rd2972/private_data/20240323_UKBB_proteomics/20240611_RAP_download/unrelated_white_47336_individuals_iid_fid_missing_sex.sample --bsize 1000 --compute-corr --out ${output_path}/01_118839067_144977494_unrelated_EUR_47336_individuals

Syslog: Jun 27 03:42:48 node66 kernel: [36934547.575843] regenie[21475]: segfault at 7fa1b1115386 ip 00005645dc8f3eec sp 00007ffe8f4a5bb0 error 6 in regenie[5645dc86b000+4c9000] Jun 27 04:45:22 node66 kernel: [36938302.027969] regenie[41917]: segfault at 7ef6cd3d4022 ip 0000565440a4ceec sp 00007fffbf955720 error 6 in regenie[5654409c4000+4c9000] Jun 27 10:31:18 node66 kernel: [36959057.185572] regenie[38418]: segfault at 7f0a68d99eba ip 000055ab13da4eec sp 00007ffc6ece3ca0 error 6 in regenie[55ab13d1c000+4c9000]

Thank you, Jose

RuiDongDR commented 1 month ago

To add to what @josemq mentioned above, here is the command we use

~/regenie \
--bgen ~/test.bgen \
--sample ~/test.sample \
--bsize 1000 \
--compute-corr \
--out  ~/test

and we are using the most up-to-date version v.3.4.1 of REGENIE.

We figured out that the regions that reports the error @josemq sent above, all contains over 65800 variants, i.e., large regions... And if we look into the other output files, it actually calculates the LD for all SV blocks successfully but encounter an issue when writing it to output

** Computing LD matrix **
  -> splitting across 113 SV blocks
     - row 1
       -> LD diagonal block computation......done (44955ms)
       -> computing LD with other variants (112 blocks)... done (8383863ms)
     - row 2
       -> LD diagonal block computation......done (81388ms)
       -> computing LD with other variants (111 blocks)... done (6268485ms)
     - row 3
       -> LD diagonal block computation......done (45037ms)
       -> computing LD with other variants (110 blocks)... done (6162939ms)
 ...
      - row 113
       -> LD diagonal block computation......done (90899ms)
     - writing to file...

and in the error message it reports

 /var/spool/slurmd/job484610/slurm_script: line 25: 21475 Segmentation fault      ~/test.bgen --sample ~/test.sample --bsize 1000 --compute-corr --out ~/test

And the job runs about 83 hours for this region until it hits to the last step.

So the issue would be in the last step when REGENIE writes to the file. Is there anyway to remove this limitation so it can handle large regions?

Thanks!

joellembatchou commented 1 month ago

Hello,

Have you monitored memory usage or the available disk space?

RuiDongDR commented 1 month ago

yes, here for this one we assigned 150G to it which is about 1.5x what REGENIE estimates.. we don't really see an issue here.

JobID             User                                  JobName    Account    Cluster  ReqCPUS     ReqMem    Elapsed                State ExitCode
------------ --------- ---------------------------------------- ---------- ---------- -------- ---------- ---------- -------------------- --------
484610          rd2972 test                             csg_lab   neurohpc        1       150G 3-11:50:18            COMPLETED      0:0

rd2972@csglogin:~/$ seff 484610
Job ID: 484610
Cluster: neurohpc
User/Group: rd2972/rd2972
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 6-14:38:46
CPU Efficiency: 94.61% of 6-23:40:36 core-walltime
Job Wall-clock time: 3-11:50:18
Memory Utilized: 98.35 GB
Memory Efficiency: 65.57% of 150.00 GB
joellembatchou commented 1 month ago

How about disk space (since the error seems to occur during the file writing)?

RuiDongDR commented 1 month ago

Here:

rd2972@csglogin:~$ df -h
Filesystem                          Size  Used Avail Use% Mounted on
udev                                252G     0  252G   0% /dev
tmpfs                                51G  4.2G   47G   9% /run
/dev/sda1                           449G   65G  361G  16% /
tmpfs                               252G  688K  252G   1% /dev/shm
tmpfs                               5.0M     0  5.0M   0% /run/lock
tmpfs                               252G     0  252G   0% /sys/fs/cgroup
mfsmaster:9421/vols/hgrcgrid        3.2P  747T  2.5P  24% /mnt/mfs/hgrcgrid
mfsmaster:9421/vols/server_cluster  7.4P  4.6P  2.8P  63% /mnt/mfs/cluster
mfsmaster:9421/vols/ctcn            910T  824T   87T  91% /mnt/mfs/ctcn
hpc.vast.neuro.columbia.edu:/hpc    6.3P  5.3P  938T  86% /mnt/vast/hpc
tmpfs                                51G  4.0K   51G   1% /run/user/111
rd2972@csglogin:~$ df -h ~rd2972
Filesystem                        Size  Used Avail Use% Mounted on
hpc.vast.neuro.columbia.edu:/hpc  6.3P  5.3P  938T  86% /mnt/vast/hpc

is this enough? thanks!

joellembatchou commented 1 month ago

Is the command using --out ~/test or was this just a template command? Also can you include the full log for one of the failed runs? Thank you

RuiDongDR commented 1 month ago

Hi @joellembatchou I replaced my full path with test in the command above. Here is the full script and log outputs. We ran this for over 1000 regions and only those with >=65800 variants have this issue. (In the .log and out files below, I removed most of the rows as it was too long.. and they have the same format as row 1 etc. that I posted).

Script that I submitted through SLURM

#!/bin/bash
#SBATCH --job-name=recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals
#SBATCH --mem=61G
#SBATCH --time=360:00:00
#SBATCH --output=/home/rd2972/recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals_%j.out
#SBATCH --error=/home/rd2972/recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals_%j.err
#SBATCH -p CSG

echo "Step 1 and 2 are skipped."

echo "=============================================================================================="
# 3. Calculate LD matrix
echo "3. Calculate the LD matrix of 03_168580960_170964909"
echo "Current timestamp: $(date +"%Y-%m-%d %H:%M:%S")"
time2=$(date +%s)

/home/rd2972/software/regenie/regenie \
--bgen /home/rd2972/03_168580960_170964909_unrelated_EUR_47336_individuals.bgen \
--sample /home/rd2972/iid_fid_missing_sex.sample \
--bsize 1000 \
--compute-corr \
--out /home/rd2972/recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals

time3=$(date +%s)
echo "Current timestamp: $(date +"%Y-%m-%d %H:%M:%S")"
time_diff3=0
time_diff_hours3=$(echo "scale=2; $time_diff3 / 3600" | bc)
echo "Step 3 takes: $time_diff3 seconds, i.e., $time_diff_hours3 hours."

Output files

.err file

/var/spool/slurmd/job494637/slurm_script: line 22: 28565 Segmentation fault      /home/rd2972/software/regenie/regenie --bgen /home/rd2972/03_168580960_170964909_unrelated_EUR_47336_individuals.bgen --sample /home/rd2972/iid_fid_missing_sex.sample --bsize 1000 --compute-corr --out /home/rd2972/recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals

.out file

3. Calculate the LD matrix of 03_168580960_170964909
Current timestamp: 2024-06-27 05:14:30
Start time: Thu Jun 27 05:14:30 2024

              |============================|
              |        REGENIE v3.4.1      |
              |============================|

Copyright (c) 2020-2024 Joelle Mbatchou, Andrey Ziyatdinov and Jonathan Marchini.
Distributed under the MIT License.

Log of output saved in file : /home/rd2972/recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals.log

Options in effect:
  --bgen /home/rd2972/03_168580960_170964909_unrelated_EUR_47336_individuals.bgen \
  --sample /home/rd2972/iid_fid_missing_sex.sample \
  --bsize 1000 \
  --compute-corr \
  --out /home/rd2972/recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals

LD computation with multithreading using OpenMP
 * bgen             : [/home/rd2972/03_168580960_170964909_unrelated_EUR_47336_individuals.bgen]
   -summary : bgen file (v1.2 layout, zlib compressed) with 47336 named samples and 77395 variants with 16-bit encoding.
   -index bgi file [/home/rd2972/03_168580960_170964909_unrelated_EUR_47336_individuals.bgen.bgi]
   -sample file: /home/rd2972/iid_fid_missing_sex.sample
 * number of individuals used in analysis = 47336
 * # threads        : [63]
 * block size       : [1000]
 * approximate memory usage : 46GB
 * computing correlation matrix in dosage mode (storing R^2 values)
  + output to binary file [/home/rd2972/recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals.corr]
  + list of snps written to [/home/rd2972/recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals.corr.snplist]
  + n_snps = 77185

** Computing LD matrix **
  -> splitting across 78 SV blocks
     - row 1
       -> LD diagonal block computation......done (143581ms)
       -> computing LD with other variants (77 blocks)... done (12167525ms)
     - row 2
       -> LD diagonal block computation......done (140216ms)
       -> computing LD with other variants (76 blocks)... done (11482459ms)
....
     - row 77
       -> LD diagonal block computation......done (132622ms)
       -> computing LD with other variants (1 blocks)... done (27826ms)
     - row 78
       -> LD diagonal block computation......done (27771ms)
     - writing to file...Current timestamp: 2024-07-02 06:25:35
Step 3 takes: 0 seconds, i.e., 0 hours.

.log file

Start time: Thu Jun 27 05:14:30 2024

              |============================|
              |        REGENIE v3.4.1      |
              |============================|

Copyright (c) 2020-2024 Joelle Mbatchou, Andrey Ziyatdinov and Jonathan Marchini.
Distributed under the MIT License.

Log of output saved in file : /home/rd2972/recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals.log

Options in effect:
  --bgen /home/rd2972/03_168580960_170964909_unrelated_EUR_47336_individuals.bgen \
  --sample /home/rd2972/iid_fid_missing_sex.sample \
  --bsize 1000 \
  --compute-corr \
  --out /home/rd2972/recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals

LD computation with multithreading using OpenMP
 * bgen             : [/home/rd2972/03_168580960_170964909_unrelated_EUR_47336_individuals.bgen]
   -summary : bgen file (v1.2 layout, zlib compressed) with 47336 named samples and 77395 variants with 16-bit encoding.
   -index bgi file [/home/rd2972/03_168580960_170964909_unrelated_EUR_47336_individuals.bgen.bgi]
   -sample file: /home/rd2972/iid_fid_missing_sex.sample
 * number of individuals used in analysis = 47336
 * # threads        : [63]
 * block size       : [1000]
 * approximate memory usage : 46GB
 * computing correlation matrix in dosage mode (storing R^2 values)
  + output to binary file [/home/rd2972/recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals.corr]
  + list of snps written to [/home/rd2972/recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals.corr.snplist]
  + n_snps = 77185

** Computing LD matrix **
  -> splitting across 78 SV blocks
     - row 1
       -> LD diagonal block computation......done (143581ms)
       -> computing LD with other variants (77 blocks)... done (12167525ms)
     - row 2
       -> LD diagonal block computation......done (140216ms)
       -> computing LD with other variants (76 blocks)... done (11482459ms)
...
     - row 77
       -> LD diagonal block computation......done (132622ms)
       -> computing LD with other variants (1 blocks)... done (27826ms)
     - row 78
       -> LD diagonal block computation......done (27771ms)
     - writing to file...

thanks for the help!