vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.12k stars 194 forks source link

Handling of spanning deletions and the * allele in vcf 4.3 #2644

Open ivargr opened 4 years ago

ivargr commented 4 years ago

It seems that vg construct does not handle the way vcf 4.3 uses the to represent variants inside spanning deletions/insertions (see this for details). It seems that vg construct just inserts a node with a as sequence. I guess this isn't really a bug, but just something that hasn't been implemented yet? I think the correct behaviour and a simple solution would be to just ignore such stars as alternate alleles when parsing the ALT information. Here is a small test case with a vcf that has a spanning deletion (first variant line) and where the following SNP line uses a * in the ALT field to represent the fact that some individuals will not be able to have that SNP:

PS: This is not critical for mosts vcf that don't use * in the ALT field, but I believe that more and more vcfs ouputed by GATK will have this problem. Another solution would of course just be to remove stars from the alt field (and adjust the genotype fields accordingly) before running vg, but I haven't found any tools able to do this.

1. What were you trying to do? variants.vcf:

##fileformat=VCFv4.3
##contig=<ID=1,length=100,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
#CROM   POS ID  REF ALT QUAL    FILTER  INFO    FORMAT
1   5   .   AAAAA   A   29  PASS    .   .   
1   6   .   A   T,* 29  PASS    .   .

reference.fasta:

>1
TTTTAAAAACCCCC

Then running vg construct -r reference.fasta -v variants.vcf | vg view -Vj - shows that a node with * as sequence has been created:

{
  "edge": [
    {
      "from": "1",
      "to": "2"
    },
    {
      "from": "1",
      "to": "3"
    },
    {
      "from": "1",
      "to": "4"
    },
    {
      "from": "2",
      "to": "5"
    },
    {
      "from": "3",
      "to": "5"
    },
    {
      "from": "4",
      "to": "5"
    },
    {
      "from": "5",
      "to": "6"
    },
    {
      "from": "1",
      "to": "6"
    }
  ],
  "node": [
    {
      "id": "1",
      "sequence": "TTTTA"
    },
    {
      "id": "2",
      "sequence": "T"
    },
    {
      "id": "3",
      "sequence": "*"
    },
    {
      "id": "4",
      "sequence": "A"
    },
    {
      "id": "5",
      "sequence": "AAA"
    },
    {
      "id": "6",
      "sequence": "CCCCC"
    }
  ],
  ..................
}

2. What did you want to happen? vg should have created the same graph, but without node 3 and the edges connected to it.

3. What actually happened? See above json output of the graph.

6. What does running vg version say?

vg version v1.21.0 "Fanano"
Compiled with g++ (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 on Linux
Linked against libstd++ 20181206
Built by anovak@hex
adamnovak commented 4 years ago

I don't think we profess to support anything new in VCF 4.3 yet. But we also shouldn't be passing through "*" characters, which are in no way sensible DNA bases. We should be throwing errors on these files right now.

ekg commented 4 years ago

I thought that we handled SV alleles? Is anything special needed to do this, or is this just a compatibility issue with 4.3?

On Tue, Feb 18, 2020, 20:23 Adam Novak notifications@github.com wrote:

I don't think we profess to support anything new in VCF 4.3 yet. But we also shouldn't be passing through "*" characters, which are in no way sensible DNA bases. We should be throwing errors on these files right now.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2644?email_source=notifications&email_token=AABDQEPNGRAIWN7CFZIXE3TRDQYUNA5CNFSM4KWOSDCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMDZJSI#issuecomment-587699401, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQELAOWFN3RQGKBSMU73RDQYUNANCNFSM4KWOSDCA .

glennhickey commented 4 years ago

This is a bit different from the SV alleles. In terms of supporting, just ignoring these * alts in vg construct should make the correct graph. But I would suspect that the GBWT logic may need updating to support these nested events? @jltsiren?

If it's too much trouble to handle, we could also look into adding a normalization routine in vcflib that just collapses these into single records.

jltsiren commented 4 years ago

GBWT construction cannot handle nested variants. The entire logic is based on the assumption that the VCF represents a linear structure. Large-scale construction would otherwise be too slow. By this assumption, nested variants must either be normalized into a single record or split into multiple non-nested records.