vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.08k stars 191 forks source link

stuck in the autoindex command around three days #4183

Closed linfanxiao closed 6 months ago

linfanxiao commented 7 months ago

*1. What were you trying to do?

cat > index.sh
#!/bin/bash
#SBATCH -n 96                              # Request one core
#SBATCH -N 3                              # Request one node (if you request more than one core with -n, also using
                                           # -N 1 means all cores will be on the same node)
#SBATCH -t 14-00:00:0                         # Runtime in D-HH:MM format
#SBATCH -p fhs-highmem                          # Partition to run in
#SBATCH --mem=999G                        # Memory total in MB (for all cores)
#SBATCH -o index_%j.out                 # File to which STDOUT will be written, including job ID
#SBATCH -e index_%j.err                 # File to which STDERR will be written, including job ID
#SBATCH --mail-type=FAIL,END                    # Type of email notification- BEGIN,END,FAIL,ALL
#SBATCH --mail-user=yc07671@umac.mo   # Email to which notifications will be sent
#Adding modules
ulimit -s unlimited
#Your program starts here
source ~/miniconda3/etc/profile.d/conda.sh
conda activate vg
export TMPDIR=/scratch2/$USER/$SLURM_JOB_ID
mkdir -p $TMPDIR
vg autoindex -t 96 -V 2 --workflow mpmap --workflow rpvg --prefix grch38 --ref-fasta GRCh38.primary_assembly.genome.fa --vcf hprc-v1.1-mc-grch38.raw.vcf.gz --tx-gff gencode.v44.primary_assembly.annotation.gtf -M 999G -T $TMPDIR

2. What did you want to happen? make the index file including gpwt,xg and so on

3. What actually happened? image image it runs three days and there is nothing new update image

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

[index_473734.txt](https://github.com/vgteam/vg/files/13575843/index_473734.txt)

5. What data and command can the vg dev team use to make the problem happen?

vcf: https://ftp.ensembl.org/pub/release-110/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz reference fasta: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.primary_assembly.genome.fa.gz gtf: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz 6. What does running vg version say?

vg version v1.52.0 "Bozen"
Compiled with g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 on Linux
Linked against libstd++ 20230528
Built by jeizenga@emerald
jeizenga commented 7 months ago

Have you checked whether you have sufficient disk space to save the indexes? Either way, this thrashing behavior isn't good, but I expect that something is preventing the files from being saved.

It might help me diagnose the issue if you included the full error output instead of just the end of it.

@jltsiren I think we might be hitting a failure case in the robustness code I implemented for the GBWT buffer size (e.g. https://github.com/vgteam/vg/blob/master/src/index_registry.cpp#L2686-L2733). I might need to add a distinct error code like I did in GCSA2.

linfanxiao commented 6 months ago

Hi jeizenga , This is the complete error output of the command. We have allocated 2T storage spaces for saving temporary files, but the process has not yet ended, and there are no significant error messages displayed in the output. It's quite unusual. index_473734.txt

jeizenga commented 6 months ago

With this behavior, I don't think it will finish. It looks to me like it's stuck in some kind of thrashing behavior, so you will probably have to kill the job. Before you do that though, can you check the size of the intermediate files in the working directory (supplied with -T) using du -h? And also the size of the outputs (starting with --prefix)? That should clarify whether it's a disk use issue. When you kill the process, the temporary files will be deleted, so be sure to check the disk usage first.

linfanxiao commented 6 months ago
[yc07671@login-0-0 VG]$ du -h tmp/
1.2G    tmp/vg-jGUbiA/dir-zf5FfW
469M    tmp/vg-jGUbiA/dir-1PYTcu
63M tmp/vg-jGUbiA/dir-Z932da
46M tmp/vg-jGUbiA/dir-EUHmY9
552K    tmp/vg-jGUbiA/dir-oDPhh9
312K    tmp/vg-jGUbiA/dir-sr8J43
99M tmp/vg-jGUbiA/dir-pmlo2R
440K    tmp/vg-jGUbiA/dir-raMbpu
332K    tmp/vg-jGUbiA/dir-51SQfM
130M    tmp/vg-jGUbiA/dir-bw3612
312K    tmp/vg-jGUbiA/dir-UYyUuv
432K    tmp/vg-jGUbiA/dir-87im8g
260K    tmp/vg-jGUbiA/dir-7OxT9r
308K    tmp/vg-jGUbiA/dir-8kFZy1
532K    tmp/vg-jGUbiA/dir-FMUkdY
166M    tmp/vg-jGUbiA/dir-2LAuku
352K    tmp/vg-jGUbiA/dir-ne4BLa
112M    tmp/vg-jGUbiA/dir-ED1QaA
144K    tmp/vg-jGUbiA/dir-ZXBGkv
155M    tmp/vg-jGUbiA/dir-AkPSXY
116K    tmp/vg-jGUbiA/dir-N3tepj
131M    tmp/vg-jGUbiA/dir-0h903d
132K    tmp/vg-jGUbiA/dir-MlSwE8
168K    tmp/vg-jGUbiA/dir-dX4yN3
216K    tmp/vg-jGUbiA/dir-P1DVVo
168K    tmp/vg-jGUbiA/dir-ULhEQm
160K    tmp/vg-jGUbiA/dir-8x419R
144K    tmp/vg-jGUbiA/dir-MLKyis
164K    tmp/vg-jGUbiA/dir-bUEyyC
152K    tmp/vg-jGUbiA/dir-ruq8kl
50M tmp/vg-jGUbiA/dir-LerJ35
172K    tmp/vg-jGUbiA/dir-n2kSzE
188K    tmp/vg-jGUbiA/dir-VTVKnk
135M    tmp/vg-jGUbiA/dir-qPbIp3
208K    tmp/vg-jGUbiA/dir-SYScTa
1.9M    tmp/vg-jGUbiA/dir-45HqzM
185M    tmp/vg-jGUbiA/dir-LpFHIq
57M tmp/vg-jGUbiA/dir-xsqhxd
76G tmp/vg-jGUbiA/dir-5hpCIS
86G tmp/vg-jGUbiA
88G tmp/

Hi jeizenga, all the files in the 'tmp' directory are temporary files, and there are no files with the 'output' prefix. I have deleted all of the tmp files. I should consider reindexing them one by one to investigate the cause of the issue. If you can provide some reference indexed pangenome or pantranscriptome such as hg38, maybe it would be better for me. Thanks for your consideration and time!

jeizenga commented 6 months ago

I just merged a PR that should help us determine what's happening. If you rebuild with the current master branch, you can try again to see if the issue is fixed. At a minimum, I expect that the thrashing behavior where it repeatedly tries to re-make the GBWT should be gone. The indexing may still fail, but we should at least get a more informative error message.