Closed Zer0day-0 closed 1 year ago
It looks like there's a variant on chromosome 3 in that VCF that we can't handle. Instead of the assert, we should probably change the code to log the variant itself, so we can work out what exactly is wrong with it.
We would like the wiki example to actually work on real data, even though it is under test with fake data.
I want to support the case @Zer0day-0 is making. Using a VCF file that has been generated by sniffles2, even after fixing the VCF to contain REF and filtering to INS&DEL only we occasionally observe
vg: src/constructor.cpp:528: vg::ConstructedChunk vg::Constructor::construct_chunk(std::string, std::string, std::vector<vcflib::Variant>, size_t) const: Assertion `!variant->isSymbolicSV()' failed.
ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.
without further explaination.
Command run is:
time ./vg construct -f -S -a -C -R chr{} -r $ref -v pre/chr{}.full.vcf.gz -t 1 -m 32 >chr{}.vg
vg version:
vg version v1.46.0 "Altamura"
Compiled with g++ (Ubuntu 10.3.0-1ubuntu1~20.04) 10.3.0 on Linux
Linked against libstd++ 20210408
Built by xian@octo
OK, with the command and files @Zer0day-0 posted and the error reporting in #3866 I managed to get a message complaining about a particular variant:
error:[vg::Constructor] On 3 @ 60824462, variant appears to be a symbolic SV, but all variants should have already been converted to explicit sequence edits.
error:[vg::Constructor] Offending variant: 3 60824463 . ATGTGTGAT... A 100 PASS AC=1;AF=0.000199681;AFR_AF=0;AMR_AF=0;AN=5008;CIEND=0,500;CIPOS=-500,0;CS=DEL_union;DP=19646;EAS_AF=0;END=60905441;EUR_AF=0.001;NS=2504;SAS_AF=0;SPAN=80978;SVLEN=-80978;SVTYPE=DEL;VT=SV;EX_TARGET
The ...
I put in there is really several screens of apparently normal sequence in the real message.
I'm not sure why this variant should appear to be a symbolic SV to vcflib; it isn't actually by the time we are looking at it here.
OK, it looks like the problem is that this variant overlaps an M
and two R
characters in the reference FASTA. vg
is supposed to read these as N
, but for some reason it is not managing to do that here. It puts them into the variant's ref allele, and then vcflib
sees an allele that isn't all A, C, T, G, and N, and decides it must be a symbolic variant.
OK, I've updated #3866 so that I can run:
time vg construct -C -S -f -R 3 -r hs37d5.fa -v ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz -t 1 -m 32 >3.vg
And I can get:
Restricting to 3 from 1 to end
warning:[vg::Constructor] Multiallelic SVs cannot be canonicalized by vcflib; skipping variants like: 3 81877 . C <CN0>,<CN2>,<CN3> 100 PASS AC=2,1,1;AF=0.000399361,0.000199681,0.000199681;AFR_AF=0,0,0;AMR_AF=0,0,0;AN=5008;CS=DUP_gs;DP=18430;EAS_AF=0,0,0;END=119932;EUR_AF=0.001,0.001,0.001;NS=2504;SAS_AF=0.001,0,0;SVTYPE=CNV;VT=SV
warning:[vg::Constructor] vcflib could not canonicalize some SVs to base-level sequence; skipping variants like: 3 204866 . A <INS:ME:LINE1> 100 PASS AC=1;AF=0.000199681;AFR_AF=0;AMR_AF=0;AN=5008;CS=L1_umary;DP=20790;EAS_AF=0;EUR_AF=0.001;MEINFO=LINE1,5392,5978,+;NS=2504;SAS_AF=0;SVLEN=586;SVTYPE=LINE1;TSD=GACCAGGGAAATAATGTAAATG;VT=SV
warning:[vg::Constructor] Unsupported IUPAC ambiguity codes found in 3; coercing to N.
real 4m29.653s
user 4m27.040s
sys 0m2.453s
So that PR should now fix this issue, and similar issues should get better error messages.
1. What were you trying to do? I am trying to generate vg graphs using GRCH37 Reference sequence(FASTA) and 1000Genome Project all chromosome vcf data. I am following the steps of the tutorial provided at this vg-wiki link.
2. What did you want to happen? I expected to see a
for all of the chromosomes.
3. What actually happened? Most of the chromosomes ran without error. However, some of them showed the following output.
4. If you got a line like
Stack trace path: /somewhere/on/your/computer/stacktrace.txt
, please copy-paste the contents of that file here:5. What data and command can the vg dev team use to make the problem happen? Command:
Data: Reference(FASTA): ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz VCF:ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz
One thing to mention here, Initially I was getting more error, rejection of the alt path, when using the official command given in the guide,
6. What does running
vg version
say?