Open ivargr opened 4 years ago
I don't think we profess to support anything new in VCF 4.3 yet. But we also shouldn't be passing through "*" characters, which are in no way sensible DNA bases. We should be throwing errors on these files right now.
I thought that we handled SV alleles? Is anything special needed to do this, or is this just a compatibility issue with 4.3?
On Tue, Feb 18, 2020, 20:23 Adam Novak notifications@github.com wrote:
I don't think we profess to support anything new in VCF 4.3 yet. But we also shouldn't be passing through "*" characters, which are in no way sensible DNA bases. We should be throwing errors on these files right now.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2644?email_source=notifications&email_token=AABDQEPNGRAIWN7CFZIXE3TRDQYUNA5CNFSM4KWOSDCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMDZJSI#issuecomment-587699401, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQELAOWFN3RQGKBSMU73RDQYUNANCNFSM4KWOSDCA .
This is a bit different from the SV alleles. In terms of supporting, just ignoring these *
alts in vg construct
should make the correct graph. But I would suspect that the GBWT logic may need updating to support these nested events? @jltsiren?
If it's too much trouble to handle, we could also look into adding a normalization routine in vcflib that just collapses these into single records.
GBWT construction cannot handle nested variants. The entire logic is based on the assumption that the VCF represents a linear structure. Large-scale construction would otherwise be too slow. By this assumption, nested variants must either be normalized into a single record or split into multiple non-nested records.
It seems that vg construct does not handle the way vcf 4.3 uses the to represent variants inside spanning deletions/insertions (see this for details). It seems that vg construct just inserts a node with a as sequence. I guess this isn't really a bug, but just something that hasn't been implemented yet? I think the correct behaviour and a simple solution would be to just ignore such stars as alternate alleles when parsing the ALT information. Here is a small test case with a vcf that has a spanning deletion (first variant line) and where the following SNP line uses a * in the ALT field to represent the fact that some individuals will not be able to have that SNP:
PS: This is not critical for mosts vcf that don't use * in the ALT field, but I believe that more and more vcfs ouputed by GATK will have this problem. Another solution would of course just be to remove stars from the alt field (and adjust the genotype fields accordingly) before running vg, but I haven't found any tools able to do this.
1. What were you trying to do? variants.vcf:
reference.fasta:
Then running
vg construct -r reference.fasta -v variants.vcf | vg view -Vj -
shows that a node with * as sequence has been created:2. What did you want to happen? vg should have created the same graph, but without node 3 and the edges connected to it.
3. What actually happened? See above json output of the graph.
6. What does running
vg version
say?