rpetit3 / vcf-annotator

Add biological annotations to variants in a given VCF file.
MIT License
26 stars 7 forks source link

Multiple alternative variants #6

Open joanmarticarreras opened 3 years ago

joanmarticarreras commented 3 years ago

Hi Robert!

First, Congrats with the project! It's pretty cool!

I've decided to give it a try for my project (virus diversity) and I realized that in my multi-sample VCFs I tend to have many genome positions with multiple possible variants (both nucleotides and deletions ""). Vcf-annotator stops when in the ALT column there are multiple symbols as in "A,G,".

As enhancement I think it would be great if it can operate with this type of data.

Cheers

Joan

rpetit3 commented 3 years ago

Hi Joan!

Thank you very much for checking out vcf-annotator. I could look into what you've suggested. Do you think you could attach an example?

Thanks!

joanmarticarreras commented 3 years ago

Hi Robert,

Here is an example:

##fileformat=VCFv4.1
##contig=<ID=1,length=29903>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
NC_045512.2 25  .   T   *,G,C   .   .   .
NC_045512.2 241 .   C   T,Y .   .   .
NC_045512.2 512 .   C   T,* .   .   .
NC_045512.2 514 .   T   C,* .   .   .
NC_045512.2 520 .   G   T,* .   .   .
NC_045512.2 521 .   G   T,* .   .   .
NC_045512.2 710 .   C   T,* .   .   .
NC_045512.2 734 .   T   *,C .   .   .
NC_045512.2 739 .   A   *,G .   .   .
NC_045512.2 745 .   C   *,A .   .   .
NC_045512.2 784 .   C   *,T .   .   .
NC_045512.2 832 .   C   *,T .   .   .
NC_045512.2 835 .   C   *,T .   .   .
NC_045512.2 875 .   C   *,T .   .   .
NC_045512.2 878 .   C   *,T .   .   .
NC_045512.2 894 .   A   *,G .   .   .
NC_045512.2 913 .   C   T,* .   .   .
NC_045512.2 960 .   G   *,A .   .   .

It comes from the program snp-sites which very nicely collects mutations in a MSA into a multi-sample VCF file.

I use both of them in a couple of pipelines, if you have a citable source for your software, let us know!

Joan

rpetit3 commented 3 years ago

Hi Joan!

I apologize for the delay! I think I have a work in place for this, but first do you know if the Y here is a typo?

NC_045512.2 241 .   C   T,Y .   .   .

I'm only asking because snp-sites seems to suggest anything not an A, T, G, or C is converted to '*' based on this https://github.com/sanger-pathogens/snp-sites/blob/52c98cb3e0ed0d336b24b27a5c0f3da4cbe44e71/src/vcf.c#L131-L134

rpetit3 commented 3 years ago

Looping @marimaro into this, since this also seems related to * in the ALT field.

In this case I'm assuming * means everything that was not an A, T, G, C, or N. Or could it also mean missing?

I guess what I'm asking is, how would you want '*' to be dealt with? Treated as an N, ignored, etc...

Would love to get your thoughts

marimaro commented 3 years ago

Hey Robert,

I'm not sure, but I found this: https://www.biostars.org/p/279971/

To run vcf-annotator without this error I simply changed the * for nothing, though this might not be the best approach.

Hope it helps and thanks for looking into this.

rpetit3 commented 3 years ago

Was yours also from snp-sites or a VCF produced by a multiple sequence alignment?

Taken from the VCF spec docs

 The ‘*’ allele is reserved to indicate that the allele is missing due to a upstream deletion.
 If there are no alternative alleles, then the missing value should be used.

Did you have any variants where the ALT was just *? Like this:

NC_045512.2 512 .   C   *   .   .   .
marimaro commented 3 years ago

Nope, only as something like T,* or A,*

rpetit3 commented 3 years ago

I'm thinking for this I will annotate the T, but add a field like has_asterisk=True in the INFO column.

Do you think that would work for you?

marimaro commented 3 years ago

For me, yes. If you add an explanation somewhere, I guess it'll be fine.

rpetit3 commented 3 years ago

Another example with * in the column https://github.com/rpetit3/vcf-annotator/issues/9#issuecomment-918117810

Tagging @BioWilko - How would you like the * to be treated?

BioWilko commented 3 years ago

I'm honestly not sure, I think your suggestion above is probably the best option. I think it's a case of snp-sites seeing a lower case "n" and getting confused but I wouldn't quote me on that.