vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.09k stars 193 forks source link

vg deconstruct "Assertion `i < filled' failed." #3367

Closed ekg closed 3 years ago

ekg commented 3 years ago

1. What were you trying to do?

I was attempting to make phased VCFs from pggb graphs made for the HPRC.

Let's take chr22 as a test case: https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/scratch/2021_07_30_pggb/chroms/chr22.pan.fa.a2fb268.e820cd3.9ea71d8.smooth.gfa.gz

I used this script:

#!/bin/bash                                                                                                                                                                                                            

set -eo pipefail

input=$1
ref=$2
threads=$3

#echo "rendering GFA"                                                                                                                                                                                                  
gfa=$(basename $input .gz)
# fixme: hack; add "haplotype identifiers" to the reference paths                                                                                                                                                      
# so that all paths can be loaded into the GBWT with sample annotations                                                                                                                                                
zcat $input | sed 's/chm13#/chm13#1#/g' | sed 's/grch38#/grch38#1#/g' >$gfa

#echo "building the XG index for $gfa"                                                                                                                                                                                 
#xg=$(basename $input .og).xg                                                                                                                                                                                          
#TEMPDIR=$(pwd) vg convert -t $threads -g -x $gfa >$xg                                                                                                                                                                 
timer=/usr/bin/time
fmt="%C\n%Us user %Ss system %P cpu %es total %MKb max memory"

echo "building the GBWT index for $gfa"
gbwt=$(basename $input .gfa.gz).gbwt
TEMPDIR=$(pwd) $timer vg gbwt --num-threads $threads --num-jobs $threads -G -o $gbwt --path-regex '(.+?)#(.+?)#(.+?)' --path-fields 'CSH' $gfa

echo "building VCF from $gfa and $gbwt"
vcf=$(basename $input .gfa.gz).vcf
TEMPDIR=$(pwd) $timer vg deconstruct -a -P $ref -g $gbwt -t $threads $gfa >$vcf

rm $gfa
pigz $vcf
pigz $gbwt

Called this way:

vcf_extract_phased.sh chr22.pan.fa.a2fb268.e820cd3.9ea71d8.smooth.gfa.gz chm13 48

2. What did you want to happen?

I wanted to obtain a fully-phased VCF representing the chromosome graph.

3. What actually happened?

building the GBWT index for chr22.pan.fa.a2fb268.e820cd3.9ea71d8.smooth.gfa
286.93user 0.56system 4:11.03elapsed 114%CPU (0avgtext+0avgdata 6971404maxresident)k
79288inputs+145552outputs (3major+76154minor)pagefaults 0swaps
building VCF from chr22.pan.fa.a2fb268.e820cd3.9ea71d8.smooth.gfa and chr22.pan.fa.a2fb268.e820cd3.9ea71d8.smooth.gbwt
vg: /home/erik/vg/include/bdsg/internal/packed_structs.hpp:742: size_t bdsg::PackedDeque<Backend>::internal_index(const size_t&) const [with Backend = bdsg::STLBackend; size_t = long unsigned int]: Assertion `i < filled' failed.
ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.
Stack trace path: /tmp/vg_crash_0NgRAa/stacktrace.txt
Please include the stack trace file in your bug report!
Command exited with non-zero status 134
1635.88user 4.73system 22:58.55elapsed 119%CPU (0avgtext+0avgdata 7987824maxresident)k

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

I'll re-run to generate this.

5. What data and command can the vg dev team use to make the problem happen?

See above.

6. What does running vg version say?

vg version v1.33.0-120-gabb490426 "Moscona"
Compiled with g++ (Ubuntu 10.3.0-1ubuntu1) 10.3.0 on Linux
Linked against libstd++ 20210408
Built by erik@copito
glennhickey commented 3 years ago

vg gbwt chops your nodes down to 1024bp, so your gbwt will not be id-compatible with your original graph.

I recommend:

Note that in the linked example, a translation is used so that the node-ids in the VCF refer to the original gfa (and not the chopped graph).

ekg commented 3 years ago

Thanks Glenn, I'll try out this workflow. I'm surprised that vg gbwt is doing the chopping!