Open josemq opened 1 month ago
To add to what @josemq mentioned above, here is the command we use
~/regenie \
--bgen ~/test.bgen \
--sample ~/test.sample \
--bsize 1000 \
--compute-corr \
--out ~/test
and we are using the most up-to-date version v.3.4.1 of REGENIE.
We figured out that the regions that reports the error @josemq sent above, all contains over 65800 variants, i.e., large regions... And if we look into the other output files, it actually calculates the LD for all SV blocks successfully but encounter an issue when writing it to output
** Computing LD matrix **
-> splitting across 113 SV blocks
- row 1
-> LD diagonal block computation......done (44955ms)
-> computing LD with other variants (112 blocks)... done (8383863ms)
- row 2
-> LD diagonal block computation......done (81388ms)
-> computing LD with other variants (111 blocks)... done (6268485ms)
- row 3
-> LD diagonal block computation......done (45037ms)
-> computing LD with other variants (110 blocks)... done (6162939ms)
...
- row 113
-> LD diagonal block computation......done (90899ms)
- writing to file...
and in the error message it reports
/var/spool/slurmd/job484610/slurm_script: line 25: 21475 Segmentation fault ~/test.bgen --sample ~/test.sample --bsize 1000 --compute-corr --out ~/test
And the job runs about 83 hours for this region until it hits to the last step.
So the issue would be in the last step when REGENIE writes to the file. Is there anyway to remove this limitation so it can handle large regions?
Thanks!
Hello,
Have you monitored memory usage or the available disk space?
yes, here for this one we assigned 150G to it which is about 1.5x what REGENIE estimates.. we don't really see an issue here.
JobID User JobName Account Cluster ReqCPUS ReqMem Elapsed State ExitCode
------------ --------- ---------------------------------------- ---------- ---------- -------- ---------- ---------- -------------------- --------
484610 rd2972 test csg_lab neurohpc 1 150G 3-11:50:18 COMPLETED 0:0
rd2972@csglogin:~/$ seff 484610
Job ID: 484610
Cluster: neurohpc
User/Group: rd2972/rd2972
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 6-14:38:46
CPU Efficiency: 94.61% of 6-23:40:36 core-walltime
Job Wall-clock time: 3-11:50:18
Memory Utilized: 98.35 GB
Memory Efficiency: 65.57% of 150.00 GB
How about disk space (since the error seems to occur during the file writing)?
Here:
rd2972@csglogin:~$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 252G 0 252G 0% /dev
tmpfs 51G 4.2G 47G 9% /run
/dev/sda1 449G 65G 361G 16% /
tmpfs 252G 688K 252G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 252G 0 252G 0% /sys/fs/cgroup
mfsmaster:9421/vols/hgrcgrid 3.2P 747T 2.5P 24% /mnt/mfs/hgrcgrid
mfsmaster:9421/vols/server_cluster 7.4P 4.6P 2.8P 63% /mnt/mfs/cluster
mfsmaster:9421/vols/ctcn 910T 824T 87T 91% /mnt/mfs/ctcn
hpc.vast.neuro.columbia.edu:/hpc 6.3P 5.3P 938T 86% /mnt/vast/hpc
tmpfs 51G 4.0K 51G 1% /run/user/111
rd2972@csglogin:~$ df -h ~rd2972
Filesystem Size Used Avail Use% Mounted on
hpc.vast.neuro.columbia.edu:/hpc 6.3P 5.3P 938T 86% /mnt/vast/hpc
is this enough? thanks!
Is the command using --out ~/test
or was this just a template command? Also can you include the full log for one of the failed runs? Thank you
Hi @joellembatchou I replaced my full path with test
in the command above. Here is the full script and log outputs. We ran this for over 1000 regions and only those with >=65800 variants have this issue.
(In the .log
and out
files below, I removed most of the rows as it was too long.. and they have the same format as row 1 etc. that I posted).
#!/bin/bash
#SBATCH --job-name=recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals
#SBATCH --mem=61G
#SBATCH --time=360:00:00
#SBATCH --output=/home/rd2972/recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals_%j.out
#SBATCH --error=/home/rd2972/recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals_%j.err
#SBATCH -p CSG
echo "Step 1 and 2 are skipped."
echo "=============================================================================================="
# 3. Calculate LD matrix
echo "3. Calculate the LD matrix of 03_168580960_170964909"
echo "Current timestamp: $(date +"%Y-%m-%d %H:%M:%S")"
time2=$(date +%s)
/home/rd2972/software/regenie/regenie \
--bgen /home/rd2972/03_168580960_170964909_unrelated_EUR_47336_individuals.bgen \
--sample /home/rd2972/iid_fid_missing_sex.sample \
--bsize 1000 \
--compute-corr \
--out /home/rd2972/recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals
time3=$(date +%s)
echo "Current timestamp: $(date +"%Y-%m-%d %H:%M:%S")"
time_diff3=0
time_diff_hours3=$(echo "scale=2; $time_diff3 / 3600" | bc)
echo "Step 3 takes: $time_diff3 seconds, i.e., $time_diff_hours3 hours."
.err
file/var/spool/slurmd/job494637/slurm_script: line 22: 28565 Segmentation fault /home/rd2972/software/regenie/regenie --bgen /home/rd2972/03_168580960_170964909_unrelated_EUR_47336_individuals.bgen --sample /home/rd2972/iid_fid_missing_sex.sample --bsize 1000 --compute-corr --out /home/rd2972/recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals
.out
file3. Calculate the LD matrix of 03_168580960_170964909
Current timestamp: 2024-06-27 05:14:30
Start time: Thu Jun 27 05:14:30 2024
|============================|
| REGENIE v3.4.1 |
|============================|
Copyright (c) 2020-2024 Joelle Mbatchou, Andrey Ziyatdinov and Jonathan Marchini.
Distributed under the MIT License.
Log of output saved in file : /home/rd2972/recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals.log
Options in effect:
--bgen /home/rd2972/03_168580960_170964909_unrelated_EUR_47336_individuals.bgen \
--sample /home/rd2972/iid_fid_missing_sex.sample \
--bsize 1000 \
--compute-corr \
--out /home/rd2972/recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals
LD computation with multithreading using OpenMP
* bgen : [/home/rd2972/03_168580960_170964909_unrelated_EUR_47336_individuals.bgen]
-summary : bgen file (v1.2 layout, zlib compressed) with 47336 named samples and 77395 variants with 16-bit encoding.
-index bgi file [/home/rd2972/03_168580960_170964909_unrelated_EUR_47336_individuals.bgen.bgi]
-sample file: /home/rd2972/iid_fid_missing_sex.sample
* number of individuals used in analysis = 47336
* # threads : [63]
* block size : [1000]
* approximate memory usage : 46GB
* computing correlation matrix in dosage mode (storing R^2 values)
+ output to binary file [/home/rd2972/recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals.corr]
+ list of snps written to [/home/rd2972/recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals.corr.snplist]
+ n_snps = 77185
** Computing LD matrix **
-> splitting across 78 SV blocks
- row 1
-> LD diagonal block computation......done (143581ms)
-> computing LD with other variants (77 blocks)... done (12167525ms)
- row 2
-> LD diagonal block computation......done (140216ms)
-> computing LD with other variants (76 blocks)... done (11482459ms)
....
- row 77
-> LD diagonal block computation......done (132622ms)
-> computing LD with other variants (1 blocks)... done (27826ms)
- row 78
-> LD diagonal block computation......done (27771ms)
- writing to file...Current timestamp: 2024-07-02 06:25:35
Step 3 takes: 0 seconds, i.e., 0 hours.
.log
fileStart time: Thu Jun 27 05:14:30 2024
|============================|
| REGENIE v3.4.1 |
|============================|
Copyright (c) 2020-2024 Joelle Mbatchou, Andrey Ziyatdinov and Jonathan Marchini.
Distributed under the MIT License.
Log of output saved in file : /home/rd2972/recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals.log
Options in effect:
--bgen /home/rd2972/03_168580960_170964909_unrelated_EUR_47336_individuals.bgen \
--sample /home/rd2972/iid_fid_missing_sex.sample \
--bsize 1000 \
--compute-corr \
--out /home/rd2972/recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals
LD computation with multithreading using OpenMP
* bgen : [/home/rd2972/03_168580960_170964909_unrelated_EUR_47336_individuals.bgen]
-summary : bgen file (v1.2 layout, zlib compressed) with 47336 named samples and 77395 variants with 16-bit encoding.
-index bgi file [/home/rd2972/03_168580960_170964909_unrelated_EUR_47336_individuals.bgen.bgi]
-sample file: /home/rd2972/iid_fid_missing_sex.sample
* number of individuals used in analysis = 47336
* # threads : [63]
* block size : [1000]
* approximate memory usage : 46GB
* computing correlation matrix in dosage mode (storing R^2 values)
+ output to binary file [/home/rd2972/recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals.corr]
+ list of snps written to [/home/rd2972/recalc_LD_03_168580960_170964909_unrelated_EUR_47336_individuals.corr.snplist]
+ n_snps = 77185
** Computing LD matrix **
-> splitting across 78 SV blocks
- row 1
-> LD diagonal block computation......done (143581ms)
-> computing LD with other variants (77 blocks)... done (12167525ms)
- row 2
-> LD diagonal block computation......done (140216ms)
-> computing LD with other variants (76 blocks)... done (11482459ms)
...
- row 77
-> LD diagonal block computation......done (132622ms)
-> computing LD with other variants (1 blocks)... done (27826ms)
- row 78
-> LD diagonal block computation......done (27771ms)
- writing to file...
thanks for the help!
Hello,
We're randomly receiving the following error in our HPC environment when Regenie is executed by our researchers:
Script output: /var/spool/slurmd/job484610/slurm_script: line 25: 21475 Segmentation fault /home/rd2972/software/regenie/regenie --bgen ${bgen_path}/01_118839067_144977494_unrelated_EUR_47336_individuals.bgen --sample /home/rd2972/private_data/20240323_UKBB_proteomics/20240611_RAP_download/unrelated_white_47336_individuals_iid_fid_missing_sex.sample --bsize 1000 --compute-corr --out ${output_path}/01_118839067_144977494_unrelated_EUR_47336_individuals
Syslog: Jun 27 03:42:48 node66 kernel: [36934547.575843] regenie[21475]: segfault at 7fa1b1115386 ip 00005645dc8f3eec sp 00007ffe8f4a5bb0 error 6 in regenie[5645dc86b000+4c9000] Jun 27 04:45:22 node66 kernel: [36938302.027969] regenie[41917]: segfault at 7ef6cd3d4022 ip 0000565440a4ceec sp 00007fffbf955720 error 6 in regenie[5654409c4000+4c9000] Jun 27 10:31:18 node66 kernel: [36959057.185572] regenie[38418]: segfault at 7f0a68d99eba ip 000055ab13da4eec sp 00007ffc6ece3ca0 error 6 in regenie[55ab13d1c000+4c9000]
Thank you, Jose