Open conJUSTover opened 1 year ago
I have similar problem too. But I am using the version v1.48.0 "Gallipoli"
I am also encountering this problem, and have spent much of the afternoon debugging it. Specifically, I am getting numerous warnings along the lines of "warning[vg::Watchdog]: Thread 56 has been checked in for 10 seconds processing: A00744:56:HWWK3DSXX:2:1154:5873:32800, A00744:56:HWWK3DSXX:2:1154:5873:32800".
I am using the .k9 indexes of the minigraph-cactus v1.1 human pangenome. These are the recommended allele-filtered graphs per the HPP-resources repository.
I understand that warnings and runtime issues similar to these were observed in non allele-filtered GRCh38-based graphs. Maybe there is an issue in the v1.1 graph?
Edited to add more info:
Would you expect surjecting with an xg index to produce different results than the gbz index? And is there a method to convert gbz -> xg in a more memory efficient manner? (I remember 1.0's xg index was around 4GB...)
Update @michael-olufemi and others: You can create a small xg index of the v1.1 pangenome that only contains reference haplotypes using the command:
ref_graph=hprc-v1.1-mc-chm13.d9.gbz # or hprc-v1.1-mc-grch38.d9.gbz
threads=4
vg convert -t $threads -x --drop-haplotypes $ref_graph > ${ref_graph%.gbz}.ref_only.xg
This morning, I ran vg surjection three times. First, I ran vg surject with either the hprc-v1.1-mc-chm13.d9.gbz or the hprc-v1.1-mc-chm13.d9.ref_only.xg index prdouced by the above command. I ran these two commands with 96 cpus and unlimited memory in a locally compiled vg v1.49 docker container. I surjected HG006.gam, a previously created gam file made with vg giraffe alignment of deepvariant's HG006 reference fastqs to the hprc-v1.1-mc-chm13.d9.gbz draft pangenome index. Afterwards, I ran the same command using the xg index I described last night, that was created without the --drop-haplotypes option.
Unfortunately for benchmarking purposes, these numbers are for surjection time and the following samtools post-processing pipeline.1
vg surject \
-A -C 0 --bam-output --max-frag-len 3000 --prune-low-cplx --interleaved \
--into-paths ref_chroms.txt \
--xg-name $xg_or_gbz_index \
-t 96 \
--sample HG006 \
HG006.gam > HG006.bam
samtools fixmate -u -m --threads 4 - HG006.bam \
| samtools sort -@ 4 -m 1073741824 -u - \
| samtools markdup -u --threads 4 -f HG006.markdup.log -@ 4 -d 2500 --no-multi-dup - - \
| samtools reheader -P -c 'sed "s/CHM13#0#chr/chr/"' - \
| samtools view --write-index -C --reference CHM13v1.1_Y.fasta -o HG006.cram -
Results
Index | # of warning messages | Peak memory | CPU-hours |
---|---|---|---|
GBZ | 18 | 24.1 GB | 89.4 |
XG (full) | 0 | 100.7 GB | 32.0 |
XG (w/ dropped haplotypes) | 0 | 7.4 GB | 36.0 |
I think there were fewer warning messages in this rune because I was not running anything other than the two surjections on our genomics server at the time.
The cram files produced when using any of these indexes were largely identical2.
Personally, I will be using the xg index during surjection in my future runs.
1 The vast majority of compute time was surjection. I normally run this command as one pipe and do not write to an intermediate bamfile - this is why I ran the benchmarks this morning with the samtools post-surjection processing commands. I am temporarily writing an intermediate bamfile because the lengthy per-read processing time that is being warned about by the vg Watchdog was causing intermittent pauses in vg surject's output. These pauses were breaking the vg -> samtools pipe. Upon resumption of output, the pipe was closed, and vg surject was crashing with a sigpipe. I haven't tested piping the xg index output to samtools, but I anticipate that vg surject's output should remain steady and piping should work well.
2 Samtools flagstat produced identical reports on the resulting crams, while samtools stats showed a difference of one (out of several hundred thousand) at a handful of depth statistics.
Thanks for your update @JosephLalli, but this unfortunately doesn't solve the issue I described. Even producing the gam files stalls, so I think your and my issues are maybe separate bugs/problems. Hopefully the developers can look into this soon 😬
Giraffe does not work well with complex graphs. In the ideal case, the graph should avoid complex structures at every level of detail, while simultaneously minimizing sequence duplication. You can run the following command with the distance index to get a quick measure of complexity:
vg stats -b graph.dist | sort -nr -k 3 | head -n 5
The command extracts some statistics about snarls and prints them for the five largest snarls. The third column in the output is the size of the snarl in nodes.
If you see substantially higher numbers, graph complexity is likely the issue, and you need to find a way of building a better graph if you want to use Giraffe.
I have exact same issue when dealing with the v1 of human pangenome with --named-coordinates and outputting to gaf, paired-end with autoindex. Watchdog keeps telling me one of the threads took more than xxx seconds before they finally checked out. I made gbz, min and dist read-only. And run it several times.
Hi, maybe this helps but from what I can tell when vg giraffe hits these errors when accessing the same index files?
My situation was that I have a single graph (with three files graph.dist, graph.giraffe.gbz, graph.min) and multiple short-read samples that I want to align.
When I run a single sample I get steady progress. When I run my samples in parallel (using a slurm array job) I get almost no progress and lots of watchdog errors. When I copy the 3 graph files to unique names for each sample for eahc array job I get steady progress again.
So it feels like there is some conflict in accessing the three graph files when running vg in parallel? I've got no idea how/why but yeah, having separate graph files for each sample let me run normally. Not something I've seen mentioned?
Hi, maybe this helps but from what I can tell when vg giraffe hits these errors when accessing the same index files?
My situation was that I have a single graph (with three files graph.dist, graph.giraffe.gbz, graph.min) and multiple short-read samples that I want to align.
When I run a single sample I get steady progress. When I run my samples in parallel (using a slurm array job) I get almost no progress and lots of watchdog errors. When I copy the 3 graph files to unique names for each sample for eahc array job I get steady progress again.
So it feels like there is some conflict in accessing the three graph files when running vg in parallel? I've got no idea how/why but yeah, having separate graph files for each sample let me run normally. Not something I've seen mentioned?
Hey Andy, thank you so much for your information. I was debugging today to try to figure out what went wrong. I had exactly the same issue. It went well with running one sample at a time a week ago but hangs while running multiples. It was my guess as well. I am using the IBM LSF scheduler with docker environment. One more thing I could add is that if run one sample at a time, the memory usage is stable, like 30~40g. But if I run multiple samples at the same time, I got out of memory issue even if I assigned 100g.
Fascinating, glad to hear I'm not the only one! Feels like something is slightly off in how mutliple vg instances interact with the index files? Copying works for me as I have a small graph but I imagine that becomes a bit of an overhead for larger graphs and lots of samples
I haven't checked the memory usage from single vs multiples but will try that too! Thanks for the note.
If it helps for troubleshooting I'm running the vg binary executable with the following details:
> vg version
vg version v1.49.0 "Peschici"
Compiled with g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0 on Linux
Linked against libstd++ 20220421
Built by anovak@octagon
Did you change the distance index to read-only permissions? This helped me, but I am still having trouble with very slow read alignment
Fascinating, glad to hear I'm not the only one! Feels like something is slightly off in how mutliple vg instances interact with the index files? Copying works for me as I have a small graph but I imagine that becomes a bit of an overhead for larger graphs and lots of samples
I haven't checked the memory usage from single vs multiples but will try that too! Thanks for the note. If it helps for troubleshooting I'm running the vg binary executable with the following details:
> vg version
vg version v1.49.0 "Peschici"
Compiled with g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0 on Linux
Linked against libstd++ 20220421
Built by anovak@octagon
I am using the entire pan-genome graph; therefore, making several copies is not feasible for me. What does work for me is creating several hard links to the gbz, min and dist (the index files), and when I run them in parallel, each vg instance is pointed to a different hard link. Hope it helps a little.
Any new update for this issue?
1. What were you trying to do? Use vg giraffe to map wgs reads to a graph.
2. What did you want to happen? The program to run to completion without issue, or at least provide some debugging output.
Here is the command I used: vg giraffe -p -t 16 -Z $ref_genome.giraffe.gbz -m $ref_genome.min -d $ref_genome.dist -f $fastq_dir/$sample_1P.fq.gz -f $fastq_dir/$sample_2P.fq.gz -o BAM > $sample.bam
I build the .gbz file using autoindex with my reference fasta file and a vcf file I created using whole genome alignemnts I generated using Anchorwave and GATK.
3. What actually happened? One some of my samples, the program ran just fine. On others, the program runs for ~15 minutes, then hangs in perpetuity (i.e. over a day). While it hangs, it does throw several errors of the general format
I'm happy to share the fastq reads, .min, .dist (which I've converted to read-only as suggested by #3954), and the .gbz file. Just let me know the best way to get them to you.
I've also ran done of the basic diagnostics suggested by #3954, but nothings seems out of the ordinary.
vg gbwt -M -Z graph.gbz
1163 paths with names, 7 samples with names, 7 haplotypes, 1103 contigs with namesvg paths -L -G -x graph.gbz | wc -l
1103vg gbwt --tags -Z graph.gbz
source jltsiren/gbwtvg stats -l graph.gbz
length 700757234 (The graph is composed of two species that have genome sizes of ~600MB and ~500MB respectively, so this is pretty close to expectations).4. If you got a line like
Stack trace path: /somewhere/on/your/computer/stacktrace.txt
, please copy-paste the contents of that file here:5. What data and command can the vg dev team use to make the problem happen?
I also tried to run this without projecting it to a bam output, but the same problem occurred.
6. What does running
vg version
say?