vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

`vg construct`: Assertion of `!variant->isSymbolicSV()` fails in the wild, and error message does not indicate anything useful about the offending variant for debugging #3838

Closed Zer0day-0 closed 1 year ago

Zer0day-0 commented 1 year ago

1. What were you trying to do? I am trying to generate vg graphs using GRCH37 Reference sequence(FASTA) and 1000Genome Project all chromosome vcf data. I am following the steps of the tutorial provided at this vg-wiki link.

2. What did you want to happen? I expected to see a

constructing graph 

real    0m11.552s
user    0m10.487s
sys 0m0.296s
Restricting to 21 from 1 to end

for all of the chromosomes.

3. What actually happened? Most of the chromosomes ran without error. However, some of them showed the following output.

Restricting to Y from 1 to end

real    0m11.552s
user    0m10.487s
sys 0m0.296s
Restricting to 21 from 1 to end

real    0m53.968s
user    0m52.247s
sys 0m0.867s
Restricting to 3 from 1 to end
warning:[vg::Constructor] Unsupported IUPAC ambiguity codes found in 3; coercing to N.
vg: src/constructor.cpp:528: vg::ConstructedChunk vg::Constructor::construct_chunk(std::string, std::string, std::vector<vcflib::Variant>, size_t) 
const: Assertion `!variant->isSymbolicSV()' failed.
ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.
Stack trace path: /tmp/vg_crash_cyRE1m/stacktrace.txt
Please include the stack trace file in your bug report!

real    1m10.802s
user    1m8.920s
sys 0m1.150s
Restricting to 22 from 1 to end

real    1m12.112s
user    1m10.591s
sys 0m0.929s
Restricting to 19 from 1 to end

real    1m12.443s
user    1m10.736s
sys 0m1.080s
Restricting to 15 from 1 to end

real    1m52.271s
user    1m49.468s
sys 0m2.130s
Restricting to 17 from 1 to end
warning:[vg::Constructor] Skipping duplicate variant with hash 7c41b008d29662e4684e6a50d48166d129389c3f at 17:1144632
warning:[vg::Constructor] Skipping duplicate variant with hash 9628041c1009eb3fdcf82cc912cb65ae7c3d6420 at 17:1144632

real    2m1.370s
user    1m58.713s
sys 0m1.791s
Restricting to 9 from 1 to end
Warning: multiple ALT alleles not yet allowed for SVs
Warning: multiple ALT alleles not yet allowed for SVs
Warning: multiple ALT alleles not yet allowed for SVs
Warning: multiple ALT alleles not yet allowed for SVs

real    2m47.857s
user    2m44.308s
sys 0m2.809s
Restricting to X from 1 to end
warning:[vg::Constructor] Skipping duplicate variant with hash 0db860eda6b6b6b5e1147774ad0c8d709945fce6 at X:5375295
warning:[vg::Constructor] Skipping duplicate variant with hash ee5d8472f5ef5c1ecf62bf4885610e9fc8c95787 at X:5375295
warning:[vg::Constructor] Skipping duplicate variant with hash 27a60ae61c038f8b4c653a12459e9586934f415c at X:32362662
warning:[vg::Constructor] Skipping duplicate variant with hash eec2b4fee5a3c8ff61e7ade8b49c199acdd36f13 at X:32362662
warning:[vg::Constructor] Skipping duplicate variant with hash a920a20970e2f3b150740376ad741f8f8a7b80bd at X:68204280
warning:[vg::Constructor] Skipping duplicate variant with hash 92f6ebe4c59e0a70a413360eb08b7fd40a89fcb6 at X:68204280

real    2m54.069s
user    2m50.425s
sys 0m2.847s
Restricting to 8 from 1 to end
warning:[vg::Constructor] Skipping duplicate variant with hash 58aa532e17c7148545cdd8e2311b835457c08f78 at 8:89329148
warning:[vg::Constructor] Skipping duplicate variant with hash 2875e0ad474338f86f0877a40007f8f98d70ebec at 8:89329148
Warning: multiple ALT alleles not yet allowed for SVs

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

Crash report for vg v1.45.0 "Alpicella"
Stack trace (most recent call last):
#12   Object "/data/sata_data/home/dl8/anaconda3/envs/pangenome/bin/vg", at 0x5e609d, in _start
#11   Object "/data/sata_data/home/dl8/anaconda3/envs/pangenome/bin/vg", at 0x1e83eef, in __libc_start_main
#10   Object "/data/sata_data/home/dl8/anaconda3/envs/pangenome/bin/vg", at 0x5b6a6e, in main
#9    Object "/data/sata_data/home/dl8/anaconda3/envs/pangenome/bin/vg", at 0xcef6cb, in vg::subcommand::Subcommand::operator()(int, char**) const
#8    Object "/data/sata_data/home/dl8/anaconda3/envs/pangenome/bin/vg", at 0xcabf56, in main_construct(int, char**)
#7    Object "/data/sata_data/home/dl8/anaconda3/envs/pangenome/bin/vg", at 0xe159c1, in vg::Constructor::construct_graph(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::function<void (vg::Graph&)> const&)
#6    Object "/data/sata_data/home/dl8/anaconda3/envs/pangenome/bin/vg", at 0xe145c0, in vg::Constructor::construct_graph(std::vector<FastaReference*, std::allocator<FastaReference*> > const&, std::vector<vcflib::VariantCallFile*, std::allocator<vcflib::VariantCallFile*> > const&, std::vector<FastaReference*, std::allocator<FastaReference*> > const&, std::function<void (vg::Graph&)> const&)
#5    Object "/data/sata_data/home/dl8/anaconda3/envs/pangenome/bin/vg", at 0xe13842, in vg::Constructor::construct_graph(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, FastaReference&, vg::VcfBuffer&, std::vector<FastaReference*, std::allocator<FastaReference*> > const&, std::function<void (vg::Graph&)> const&)
#4    Object "/data/sata_data/home/dl8/anaconda3/envs/pangenome/bin/vg", at 0xe0ed5e, in vg::Constructor::construct_chunk(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<vcflib::Variant, std::allocator<vcflib::Variant> >, unsigned long) const
#3    Object "/data/sata_data/home/dl8/anaconda3/envs/pangenome/bin/vg", at 0x1e94295, in __assert_fail
#2    Object "/data/sata_data/home/dl8/anaconda3/envs/pangenome/bin/vg", at 0x5b5e67, in __assert_fail_base.cold
#1    Object "/data/sata_data/home/dl8/anaconda3/envs/pangenome/bin/vg", at 0x5b5f97, in abort
#0    Object "/data/sata_data/home/dl8/anaconda3/envs/pangenome/bin/vg", at 0x144d17b, in raise

5. What data and command can the vg dev team use to make the problem happen? Command:

(seq 1 22; echo X; echo Y) | parallel -j 24 "time vg construct -C -S -f -R {} -r $ref -v $vars -t 1 -m 32 >{}.vg"

Data: Reference(FASTA): ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz VCF:ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz

One thing to mention here, Initially I was getting more error, rejection of the alt path, when using the official command given in the guide,

(seq 1 22; echo X; echo Y) | parallel -j 24 "time vg construct -C -R {} -r $ref -v $vars -t 1 -m 32 >{}.vg"

6. What does running vg version say?

vg version v1.45.0 "Alpicella"
Compiled with g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 on Linux
Linked against libstd++ 20210601
Built by anovak@octagon
adamnovak commented 1 year ago

It looks like there's a variant on chromosome 3 in that VCF that we can't handle. Instead of the assert, we should probably change the code to log the variant itself, so we can work out what exactly is wrong with it.

We would like the wiki example to actually work on real data, even though it is under test with fake data.

bruderjakob12 commented 1 year ago

I want to support the case @Zer0day-0 is making. Using a VCF file that has been generated by sniffles2, even after fixing the VCF to contain REF and filtering to INS&DEL only we occasionally observe

vg: src/constructor.cpp:528: vg::ConstructedChunk vg::Constructor::construct_chunk(std::string, std::string, std::vector<vcflib::Variant>, size_t) const: Assertion `!variant->isSymbolicSV()' failed.
ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.

without further explaination.

Command run is:

time ./vg construct -f -S -a -C -R chr{} -r $ref -v pre/chr{}.full.vcf.gz -t 1 -m 32 >chr{}.vg

vg version:

vg version v1.46.0 "Altamura"
Compiled with g++ (Ubuntu 10.3.0-1ubuntu1~20.04) 10.3.0 on Linux
Linked against libstd++ 20210408
Built by xian@octo
adamnovak commented 1 year ago

OK, with the command and files @Zer0day-0 posted and the error reporting in #3866 I managed to get a message complaining about a particular variant:

error:[vg::Constructor] On 3 @ 60824462, variant appears to be a symbolic SV, but all variants should have already been converted to explicit sequence edits.
error:[vg::Constructor] Offending variant: 3    60824463    .   ATGTGTGAT... A  100 PASS    AC=1;AF=0.000199681;AFR_AF=0;AMR_AF=0;AN=5008;CIEND=0,500;CIPOS=-500,0;CS=DEL_union;DP=19646;EAS_AF=0;END=60905441;EUR_AF=0.001;NS=2504;SAS_AF=0;SPAN=80978;SVLEN=-80978;SVTYPE=DEL;VT=SV;EX_TARGET

The ... I put in there is really several screens of apparently normal sequence in the real message.

I'm not sure why this variant should appear to be a symbolic SV to vcflib; it isn't actually by the time we are looking at it here.

adamnovak commented 1 year ago

OK, it looks like the problem is that this variant overlaps an M and two R characters in the reference FASTA. vg is supposed to read these as N, but for some reason it is not managing to do that here. It puts them into the variant's ref allele, and then vcflib sees an allele that isn't all A, C, T, G, and N, and decides it must be a symbolic variant.

adamnovak commented 1 year ago

OK, I've updated #3866 so that I can run:

time vg construct -C -S -f -R 3 -r hs37d5.fa -v ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz -t 1 -m 32 >3.vg

And I can get:

Restricting to 3 from 1 to end
warning:[vg::Constructor] Multiallelic SVs cannot be canonicalized by vcflib; skipping variants like: 3 81877   .   C   <CN0>,<CN2>,<CN3>   100 PASS    AC=2,1,1;AF=0.000399361,0.000199681,0.000199681;AFR_AF=0,0,0;AMR_AF=0,0,0;AN=5008;CS=DUP_gs;DP=18430;EAS_AF=0,0,0;END=119932;EUR_AF=0.001,0.001,0.001;NS=2504;SAS_AF=0.001,0,0;SVTYPE=CNV;VT=SV
warning:[vg::Constructor] vcflib could not canonicalize some SVs to base-level sequence; skipping variants like: 3  204866  .   A   <INS:ME:LINE1>  100 PASS    AC=1;AF=0.000199681;AFR_AF=0;AMR_AF=0;AN=5008;CS=L1_umary;DP=20790;EAS_AF=0;EUR_AF=0.001;MEINFO=LINE1,5392,5978,+;NS=2504;SAS_AF=0;SVLEN=586;SVTYPE=LINE1;TSD=GACCAGGGAAATAATGTAAATG;VT=SV
warning:[vg::Constructor] Unsupported IUPAC ambiguity codes found in 3; coercing to N.

real    4m29.653s
user    4m27.040s
sys 0m2.453s

So that PR should now fix this issue, and similar issues should get better error messages.