vg giraffe hangs on some samples, works fine on others

conJUSTover commented 1 year ago

1. What were you trying to do? Use vg giraffe to map wgs reads to a graph.

2. What did you want to happen? The program to run to completion without issue, or at least provide some debugging output.

Here is the command I used: vg giraffe -p -t 16 -Z $ref_genome.giraffe.gbz -m $ref_genome.min -d $ref_genome.dist -f $fastq_dir/$sample_1P.fq.gz -f $fastq_dir/$sample_2P.fq.gz -o BAM > $sample.bam

I build the .gbz file using autoindex with my reference fasta file and a vcf file I created using whole genome alignemnts I generated using Anchorwave and GATK.

3. What actually happened? One some of my samples, the program ran just fine. On others, the program runs for ~15 minutes, then hangs in perpetuity (i.e. over a day). While it hangs, it does throw several errors of the general format

Preparing Indexes
Loading Minimizer Index
Loading GBZ
Loading Distance Index v2
Paging in Distance Index v2
Initializing MinimizerMapper
Loading and initialization: 15.056 seconds
Of which Distance Index v2 paging: 0.796897 seconds
Mapping reads to "-" (GAM)
--watchdog-timeout 10
--max-multimaps 1
--hit-cap 10
--hard-hit-cap 500
--score-fraction 0.9
--max-min 500
--num-bp-per-min 1000
--distance-limit 200
--max-extensions 800
--max-alignments 8
--cluster-score 50
--pad-cluster-score 20
--cluster-coverage 0.3
--extension-score 1
--extension-set 20
--rescue-attempts 15
--max-fragment-length 2000
--paired-distance-limit 2
--rescue-subgraph-size 4
--rescue-seed-limit 100
--chaining-cluster-distance 100
--precluster-connection-coverage-threshold 0.3
--min-precluster-connections 10
--max-precluster-connections 50
--max-lookback-bases 100
--min-lookback-items 1
--lookback-item-hard-cap 15
--chain-score-threshold 100
--min-chains 1
--chain-min-score 100
--max-chain-connection 100
--max-tail-length 100
--max-dp-cells 16777216
--rescue-algorithm dozeu
Using fragment length estimate: 440.232 +/- 18.0647
warning[vg::Watchdog]: Thread 2 has been checked in for 10 seconds processing: SRR13661969.262417, SRR13661969.262417
warning[vg::Watchdog]: Thread 7 has been checked in for 10 seconds processing: SRR13661969.512935, SRR13661969.512935
warning[vg::Watchdog]: Thread 13 has been checked in for 10 seconds processing: SRR13661969.981065, SRR13661969.981065
warning[vg::Watchdog]: Thread 11 has been checked in for 10 seconds processing: SRR13661969.997731, SRR13661969.997731
warning[vg::Watchdog]: Thread 12 has been checked in for 10 seconds processing: SRR13661969.1719281, SRR13661969.1719281
warning[vg::Watchdog]: Thread 3 has been checked in for 10 seconds processing: SRR13661969.1434530, SRR13661969.1434530
warning[vg::Watchdog]: Thread 8 has been checked in for 10 seconds processing: SRR13661969.1521525, SRR13661969.1521525
warning[vg::Watchdog]: Thread 7 finally checked out after 88 seconds and 54736 kb memory growth processing: SRR13661969.512935, SRR13661969.512935
warning[vg::Watchdog]: Thread 12 finally checked out after 69 seconds and 0 kb memory growth processing: SRR13661969.1719281, SRR13661969.1719281
warning[vg::Watchdog]: Thread 4 has been checked in for 10 seconds processing: SRR13661969.1897972, SRR13661969.1897972
warning[vg::Watchdog]: Thread 0 has been checked in for 10 seconds processing: SRR13661969.1938190, SRR13661969.1938190
warning[vg::Watchdog]: Thread 14 has been checked in for 10 seconds processing: SRR13661969.1939422, SRR13661969.1939422
warning[vg::Watchdog]: Thread 10 has been checked in for 10 seconds processing: SRR13661969.2055872, SRR13661969.2055872
warning[vg::Watchdog]: Thread 1 has been checked in for 10 seconds processing: SRR13661969.2128252, SRR13661969.2128252
warning[vg::Watchdog]: Thread 15 has been checked in for 10 seconds processing: SRR13661969.2200416, SRR13661969.2200416
warning[vg::Watchdog]: Thread 12 has been checked in for 10 seconds processing: SRR13661969.3135285, SRR13661969.3135285
warning[vg::Watchdog]: Thread 5 has been checked in for 10 seconds processing: SRR13661969.2372740, SRR13661969.2372740
warning[vg::Watchdog]: Thread 0 finally checked out after 94 seconds and 0 kb memory growth processing: SRR13661969.1938190, SRR13661969.1938190
warning[vg::Watchdog]: Thread 15 finally checked out after 277 seconds and 0 kb memory growth processing: SRR13661969.2200416, SRR13661969.2200416

I'm happy to share the fastq reads, .min, .dist (which I've converted to read-only as suggested by #3954), and the .gbz file. Just let me know the best way to get them to you.

I've also ran done of the basic diagnostics suggested by #3954, but nothings seems out of the ordinary.

vg gbwt -M -Z graph.gbz 1163 paths with names, 7 samples with names, 7 haplotypes, 1103 contigs with names

vg paths -L -G -x graph.gbz | wc -l 1103

vg gbwt --tags -Z graph.gbz source jltsiren/gbwt

vg stats -l graph.gbz length 700757234 (The graph is composed of two species that have genome sizes of ~600MB and ~500MB respectively, so this is pretty close to expectations).

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

5. What data and command can the vg dev team use to make the problem happen?

vg giraffe -p -t 16 -Z $ref_genome.giraffe.gbz -m $ref_genome.min -d $ref_genome.dist -f $fastq_dir/$sample_1P.fq.gz -f $fastq_dir/$sample_2P.fq.gz -o BAM > $sample.bam

I also tried to run this without projecting it to a bam output, but the same problem occurred.

6. What does running vg version say?

vg version v1.47.0 "Ostuni"
Compiled with g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 on Linux
Linked against libstd++ 20210601
Built by anovak@octagon

michael-olufemi commented 1 year ago

I have similar problem too. But I am using the version v1.48.0 "Gallipoli"

JosephLalli commented 1 year ago

I am also encountering this problem, and have spent much of the afternoon debugging it. Specifically, I am getting numerous warnings along the lines of "warning[vg::Watchdog]: Thread 56 has been checked in for 10 seconds processing: A00744:56:HWWK3DSXX:2:1154:5873:32800, A00744:56:HWWK3DSXX:2:1154:5873:32800".

I recently updated to the v1.1 draft human pangenome; I was not having this problem earlier. I can't say for sure that this is exclusively a v1.1 issue, as I have not done a head-to-head comparison yet.
The problem does not occur when aligning to gam, only during subsequent surjection of the gam.
Surjection normally has extremely stable memory usage, but these errors coincide with increases in memory usage (15GB -> 25GB) that are causing my pipes to crash unexpectedly.
This flurry of warnings appears in a sample specific manner. All benchmark samples in gs://deepvariant/benchmarking/fastq/wgs_pcr_free/30x see this pattern of warnings and increased memory usage when aligning to CHM13-based graphs. When using a GRCh38-based pangenomic graph, only HG007 causes this issue.

I am using the .k9 indexes of the minigraph-cactus v1.1 human pangenome. These are the recommended allele-filtered graphs per the HPP-resources repository.

I understand that warnings and runtime issues similar to these were observed in non allele-filtered GRCh38-based graphs. Maybe there is an issue in the v1.1 graph?

Edited to add more info:

With enough memory, I was able to run vg surject on a test gam produced from the CHM13-based v1.1 graph. I piped stdout to /dev/null for benchmarking purposes, so I can't speak to the quality of the surjection. 375 warning messages were produced, and surjection took 118 cpu-hours on our server.
I had previously created an xg index of the v1.1 graph using the command "vg convert -x $1_1pangenome > 1_1pangenome.xg". Providing vg surject with an xg index instead of a gbz index eliminated the previously mentioned warnings and decreased required cpu hours (118 cpu hours -> 31 cpu hours). -- note that using vg convert to produce the xg index created an 81GB index, which was completely loaded into memory by vg surject. (Max RSS memory usage when surjecting with a .gbz index: 25GB. Max memory with my xg index: 100GB.)

Would you expect surjecting with an xg index to produce different results than the gbz index? And is there a method to convert gbz -> xg in a more memory efficient manner? (I remember 1.0's xg index was around 4GB...)

JosephLalli commented 1 year ago

Update @michael-olufemi and others: You can create a small xg index of the v1.1 pangenome that only contains reference haplotypes using the command:

ref_graph=hprc-v1.1-mc-chm13.d9.gbz # or hprc-v1.1-mc-grch38.d9.gbz
threads=4
vg convert -t $threads -x --drop-haplotypes $ref_graph > ${ref_graph%.gbz}.ref_only.xg

This morning, I ran vg surjection three times. First, I ran vg surject with either the hprc-v1.1-mc-chm13.d9.gbz or the hprc-v1.1-mc-chm13.d9.ref_only.xg index prdouced by the above command. I ran these two commands with 96 cpus and unlimited memory in a locally compiled vg v1.49 docker container. I surjected HG006.gam, a previously created gam file made with vg giraffe alignment of deepvariant's HG006 reference fastqs to the hprc-v1.1-mc-chm13.d9.gbz draft pangenome index. Afterwards, I ran the same command using the xg index I described last night, that was created without the --drop-haplotypes option.

Unfortunately for benchmarking purposes, these numbers are for surjection time and the following samtools post-processing pipeline.¹

vg surject \
      -A -C 0 --bam-output --max-frag-len 3000 --prune-low-cplx --interleaved \
      --into-paths ref_chroms.txt \
      --xg-name $xg_or_gbz_index \
      -t 96 \
      --sample HG006 \
      HG006.gam > HG006.bam 

samtools fixmate -u -m --threads 4 - HG006.bam \
| samtools sort -@ 4 -m 1073741824 -u - \
| samtools markdup -u --threads 4 -f HG006.markdup.log -@ 4 -d 2500 --no-multi-dup - -  \
| samtools reheader -P -c 'sed "s/CHM13#0#chr/chr/"' - \
| samtools view --write-index -C --reference CHM13v1.1_Y.fasta -o HG006.cram -

Results

Index	# of warning messages	Peak memory	CPU-hours
GBZ	18	24.1 GB	89.4
XG (full)	0	100.7 GB	32.0
XG (w/ dropped haplotypes)	0	7.4 GB	36.0

I think there were fewer warning messages in this rune because I was not running anything other than the two surjections on our genomics server at the time.

The cram files produced when using any of these indexes were largely identical².

Personally, I will be using the xg index during surjection in my future runs.

¹ The vast majority of compute time was surjection. I normally run this command as one pipe and do not write to an intermediate bamfile - this is why I ran the benchmarks this morning with the samtools post-surjection processing commands. I am temporarily writing an intermediate bamfile because the lengthy per-read processing time that is being warned about by the vg Watchdog was causing intermittent pauses in vg surject's output. These pauses were breaking the vg -> samtools pipe. Upon resumption of output, the pipe was closed, and vg surject was crashing with a sigpipe. I haven't tested piping the xg index output to samtools, but I anticipate that vg surject's output should remain steady and piping should work well.

² Samtools flagstat produced identical reports on the resulting crams, while samtools stats showed a difference of one (out of several hundred thousand) at a handful of depth statistics.

conJUSTover commented 1 year ago

Thanks for your update @JosephLalli, but this unfortunately doesn't solve the issue I described. Even producing the gam files stalls, so I think your and my issues are maybe separate bugs/problems. Hopefully the developers can look into this soon 😬

jltsiren commented 1 year ago

Giraffe does not work well with complex graphs. In the ideal case, the graph should avoid complex structures at every level of detail, while simultaneously minimizing sequence duplication. You can run the following command with the distance index to get a quick measure of complexity:

vg stats -b graph.dist | sort -nr -k 3 | head -n 5

The command extracts some statistics about snarls and prints them for the five largest snarls. The third column in the output is the size of the snarl in nodes.

In the default Minigraph-Cactus HPRC graph, the largest snarls are 7857, 4130, 3683, 3168, and 2706 nodes. Giraffe works reasonably well with this graph.
There is a filtered version of the same graph, with all variants used by fewer than 10% of the haplotypes left out. The largest snarls are 1346, 378, 293, 293, and 282 nodes. This version of the HPRC graph is intended for Giraffe.
We also have an experimental haplotype sampling approach that builds a sample-specific subgraph based on kmer counts. In one such graph based on the HPRC graph, the largest snarls have 160, 160, 157, 156, and 156 nodes. Giraffe is much faster with this graph than the other two graphs.

If you see substantially higher numbers, graph complexity is likely the issue, and you need to find a way of building a better graph if you want to use Giraffe.

Hendricks27 commented 1 year ago

I have exact same issue when dealing with the v1 of human pangenome with --named-coordinates and outputting to gaf, paired-end with autoindex. Watchdog keeps telling me one of the threads took more than xxx seconds before they finally checked out. I made gbz, min and dist read-only. And run it several times.

Andy-B-123 commented 1 year ago

Hi, maybe this helps but from what I can tell when vg giraffe hits these errors when accessing the same index files?

My situation was that I have a single graph (with three files graph.dist, graph.giraffe.gbz, graph.min) and multiple short-read samples that I want to align.

When I run a single sample I get steady progress. When I run my samples in parallel (using a slurm array job) I get almost no progress and lots of watchdog errors. When I copy the 3 graph files to unique names for each sample for eahc array job I get steady progress again.

So it feels like there is some conflict in accessing the three graph files when running vg in parallel? I've got no idea how/why but yeah, having separate graph files for each sample let me run normally. Not something I've seen mentioned?

Hendricks27 commented 1 year ago

Hi, maybe this helps but from what I can tell when vg giraffe hits these errors when accessing the same index files?

My situation was that I have a single graph (with three files graph.dist, graph.giraffe.gbz, graph.min) and multiple short-read samples that I want to align.

When I run a single sample I get steady progress. When I run my samples in parallel (using a slurm array job) I get almost no progress and lots of watchdog errors. When I copy the 3 graph files to unique names for each sample for eahc array job I get steady progress again.

So it feels like there is some conflict in accessing the three graph files when running vg in parallel? I've got no idea how/why but yeah, having separate graph files for each sample let me run normally. Not something I've seen mentioned?

Hey Andy, thank you so much for your information. I was debugging today to try to figure out what went wrong. I had exactly the same issue. It went well with running one sample at a time a week ago but hangs while running multiples. It was my guess as well. I am using the IBM LSF scheduler with docker environment. One more thing I could add is that if run one sample at a time, the memory usage is stable, like 30~40g. But if I run multiple samples at the same time, I got out of memory issue even if I assigned 100g.

Andy-B-123 commented 1 year ago

Fascinating, glad to hear I'm not the only one! Feels like something is slightly off in how mutliple vg instances interact with the index files? Copying works for me as I have a small graph but I imagine that becomes a bit of an overhead for larger graphs and lots of samples

I haven't checked the memory usage from single vs multiples but will try that too! Thanks for the note.
If it helps for troubleshooting I'm running the vg binary executable with the following details:
> vg version
vg version v1.49.0 "Peschici"
Compiled with g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0 on Linux
Linked against libstd++ 20220421
Built by anovak@octagon

IsaacDiaz026 commented 1 year ago

Did you change the distance index to read-only permissions? This helped me, but I am still having trouble with very slow read alignment

Hendricks27 commented 1 year ago

Fascinating, glad to hear I'm not the only one! Feels like something is slightly off in how mutliple vg instances interact with the index files? Copying works for me as I have a small graph but I imagine that becomes a bit of an overhead for larger graphs and lots of samples

I haven't checked the memory usage from single vs multiples but will try that too! Thanks for the note. If it helps for troubleshooting I'm running the vg binary executable with the following details: > vg version vg version v1.49.0 "Peschici" Compiled with g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0 on Linux Linked against libstd++ 20220421 Built by anovak@octagon

I am using the entire pan-genome graph; therefore, making several copies is not feasible for me. What does work for me is creating several hard links to the gbz, min and dist (the index files), and when I run them in parallel, each vg instance is pointed to a different hard link. Hope it helps a little.

w-korani commented 1 month ago

Any new update for this issue?

vgteam / vg

vg giraffe hangs on some samples, works fine on others #4020