vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.07k stars 191 forks source link

VCF file empty when calling SV on ONT data #4279

Open SwenDiepstraten opened 2 months ago

SwenDiepstraten commented 2 months ago

1. What were you trying to do? calling structural variation after mapping long-read data on whole genome pangenome graph

2. What did you want to happen? create a VCF file which contained the variants based on the reads

3. What actually happened? The VCF file stays empty

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

Place stacktrace here.

5. What data and command can the vg dev team use to make the problem happen? reads were aligned with GraphAligner: GraphAligner -t 32 -g whole_genome.gfa -f reads/LW1.fastq.gz -a mapped_LW.gam -x vg

And I wanted to call variants following the tutorial on https://github.com/vgteam/vg/wiki/SV-Genotyping-and-variant-calling

vg convert -g whole_genome.gfa -p -t 32 > output_WG.pg
vg augment output_WG.pg mapped_LW.gam -t 32 -m 4 -q 5 -Q 5 -A mapped_LW_aug.gam > mapped_LW_aug.pg
vg snarls -t 32 mapped_LW_aug.pg > mapped_LW_aug.snarls
vg pack -t 32 -x mapped_LW_aug.pg -g mapped_LW_aug.gam -o mapped_LW_aug.pack
vg call mapped_LW_aug.pg -t 32 -r mapped_LW_aug.snarls -k mapped_LW_aug.pack > mapped_LW_snarls.vcf

6. What does running vg version say?

vg version v1.55.0 "Bernolda"
Compiled with g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 on Linux
Linked against libstd++ 20230528
Built by jeizenga@mustard

Hopefully you can assist me!

glennhickey commented 2 months ago

I'd suggest not using vg augment -- just call the SVs directly on your graph.

Others have had issues where GraphAligner does not write mapping qualities (you can check your GAM with vg view -a). In this case, any MAPQ filter (vg pack -Q 5 or in your case vg augment -Q 5) will filter all the reads leading to no calls. Try

vg convert -g whole_genome.gfa -p -t 32 > output_WG.pg
vg snarls -t 32 output_WG.pg > mapped_LW_aug.snarls
vg pack -t 32 -x output_WG.pg -g mapped_LW.gam -o mapped_LW.pack
vg call output_WG.pg -t 32 -r mapped_LW.snarls -k mapped_LW.pack > mapped_LW_snarls.vcf
SwenDiepstraten commented 2 months ago

after running this code for over 4 hours it is still an empty file. the first couple steps took only a fraction, but the vcf file stays empty

SwenDiepstraten commented 2 months ago

I was also wondering how the vcf file is generated. Does the entire vcf have to be loaded into memory and is then pasted into the output file, or is it procedurally generated in the output?

glennhickey commented 1 month ago

It outputs the VCF all at once at the end. vg call can be very slow on some complex graphs. You can often manage this by using -C to limit the size of alt alleles to search for in the graph.

SwenDiepstraten commented 1 month ago

Do you know what would be a good cutoff value be when looking in a mammalian genome?

SwenDiepstraten commented 1 month ago

And what timeframe should I keep in mind for producing the VCF file, my gam file is 27.7 GB, the reads are 38.3 GB, my VG graph is 6.06 GB and contains 52.4 million nodes, 71.9 million edges and a total length of 2.7 billion.

Thank!

glennhickey commented 1 month ago

There's a --progress option that may help you judge where it is. Otherwise, the running time is extremely dependent on the graph. If you have many reference paths, using -p/-S to select a reference can help. If you have many haplotypes in general in your graph, you can convert it to gbz with vg gbwt and run call with -z to only explore these haplotypes (speeding up the search)

SwenDiepstraten commented 1 month ago

Unfortunately the --progress option is not available for me in vg call, but I will try to run with gbz and see how that goes

SwenDiepstraten commented 1 month ago

When trying to run with a gbz file, I encounter the following error, what could cause this? I am running the following code now:

vg gbwt -G /minigraph_cactus/output_WG/output_WG.gfa -o output_WG.gbwt -d temp -p vg gbwt -G /minigraph_cactus/output_WG/output_WG.gfa --graph-name output_WG.gbz --gbz-format -p

vg snarls -t 32 output_WG.gbz > mapped_LW.snarls vg pack -t 32 -x output_WG.gbz -g graphaligner/mapped_LW.gam -o mapped_LW.pack

vg call output_WG.gbz -z -t 32 -r mapped_LW.snarls -k mapped_LW.pack -C 100000 > mapped_C100000_LW_snarls_git.vcf

vg: /private/groups/patenlab/jeizenga/GitHub/vg/include/sdsl/int_vector.hpp:1391: sdsl::int_vector< >::reference sdsl::int_vector< >::operator[](const size_type&) [with unsigned char t_width = 0; sdsl::int_vector< >::reference = sdsl::int_vector_reference<sdsl::int_vector<0> >; sdsl::int_vector< >::size_type = long unsigned int]: Assertion `idx < this->size()' failed. ━━━━━━━━━━━━━━━━━━━━ Crash report for vg v1.55.0 "Bernolda" Stack trace (most recent call last) in thread 872463:

14 Object "", at 0xffffffffffffffff, in

13 Object "/lustre/nobackup/TOPIGS/shared/swen/cactus-bin-v2.8.0/bin/vg", at 0x2160633, in __clone

12 Object "/lustre/nobackup/TOPIGS/shared/swen/cactus-bin-v2.8.0/bin/vg", at 0x20b9d4a, in start_thread

11 Object "/lustre/nobackup/TOPIGS/shared/swen/cactus-bin-v2.8.0/bin/vg", at 0x205c5dd, in gomp_thread_start

10 Object "/lustre/nobackup/TOPIGS/shared/swen/cactus-bin-v2.8.0/bin/vg", at 0x205ef27, in gomp_team_barrier_wait_end

9 Object "/lustre/nobackup/TOPIGS/shared/swen/cactus-bin-v2.8.0/bin/vg", at 0x205682a, in gomp_barrier_handle_tasks

8 Object "/lustre/nobackup/TOPIGS/shared/swen/cactus-bin-v2.8.0/bin/vg", at 0xdbc505, in void vg::io::for_each_parallel_impl(std::istream&, std::function<void (vg::Alignment&, vg::Alignment&)> const&, std::function<void (vg::Alignment&)> const&, std::function<bool ()> const&, unsigned long) [clone ._omp_fn.1]

7 Object "/lustre/nobackup/TOPIGS/shared/swen/cactus-bin-v2.8.0/bin/vg", at 0x1272b8a, in vg::Packer::add(vg::Alignment const&, int, int, int)

6 Object "/lustre/nobackup/TOPIGS/shared/swen/cactus-bin-v2.8.0/bin/vg", at 0x126a338, in vg::Packer::increment_coverage(unsigned long)

5 Object "/lustre/nobackup/TOPIGS/shared/swen/cactus-bin-v2.8.0/bin/vg", at 0x1264a71, in sdsl::int_vector<(unsigned char)0>::operator[](unsigned long const&) [clone .isra.0]

4 Object "/lustre/nobackup/TOPIGS/shared/swen/cactus-bin-v2.8.0/bin/vg", at 0x2088545, in __assert_fail

3 Object "/lustre/nobackup/TOPIGS/shared/swen/cactus-bin-v2.8.0/bin/vg", at 0x5e6053, in __assert_fail_base.cold

2 Object "/lustre/nobackup/TOPIGS/shared/swen/cactus-bin-v2.8.0/bin/vg", at 0x5e612b, in abort

1 Object "/lustre/nobackup/TOPIGS/shared/swen/cactus-bin-v2.8.0/bin/vg", at 0x208eb55, in raise

0 Object "/lustre/nobackup/TOPIGS/shared/swen/cactus-bin-v2.8.0/bin/vg", at 0x20bb56c, in __pthread_kill

ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug. Please include this entire error log in your bug report! ━━━━━━━━━━━━━━━━━━━━