Open joanmarticarreras opened 3 years ago
Hi Joan!
Thank you very much for checking out vcf-annotator. I could look into what you've suggested. Do you think you could attach an example?
Thanks!
Hi Robert,
Here is an example:
##fileformat=VCFv4.1
##contig=<ID=1,length=29903>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO
NC_045512.2 25 . T *,G,C . . .
NC_045512.2 241 . C T,Y . . .
NC_045512.2 512 . C T,* . . .
NC_045512.2 514 . T C,* . . .
NC_045512.2 520 . G T,* . . .
NC_045512.2 521 . G T,* . . .
NC_045512.2 710 . C T,* . . .
NC_045512.2 734 . T *,C . . .
NC_045512.2 739 . A *,G . . .
NC_045512.2 745 . C *,A . . .
NC_045512.2 784 . C *,T . . .
NC_045512.2 832 . C *,T . . .
NC_045512.2 835 . C *,T . . .
NC_045512.2 875 . C *,T . . .
NC_045512.2 878 . C *,T . . .
NC_045512.2 894 . A *,G . . .
NC_045512.2 913 . C T,* . . .
NC_045512.2 960 . G *,A . . .
It comes from the program snp-sites which very nicely collects mutations in a MSA into a multi-sample VCF file.
I use both of them in a couple of pipelines, if you have a citable source for your software, let us know!
Joan
Hi Joan!
I apologize for the delay! I think I have a work in place for this, but first do you know if the Y
here is a typo?
NC_045512.2 241 . C T,Y . . .
I'm only asking because snp-sites
seems to suggest anything not an A, T, G, or C is converted to '*' based on this https://github.com/sanger-pathogens/snp-sites/blob/52c98cb3e0ed0d336b24b27a5c0f3da4cbe44e71/src/vcf.c#L131-L134
Looping @marimaro into this, since this also seems related to *
in the ALT field.
In this case I'm assuming *
means everything that was not an A, T, G, C, or N. Or could it also mean missing?
I guess what I'm asking is, how would you want '*' to be dealt with? Treated as an N, ignored, etc...
Would love to get your thoughts
Hey Robert,
I'm not sure, but I found this: https://www.biostars.org/p/279971/
To run vcf-annotator without this error I simply changed the *
for nothing, though this might not be the best approach.
Hope it helps and thanks for looking into this.
Was yours also from snp-sites
or a VCF produced by a multiple sequence alignment?
Taken from the VCF spec docs
The ‘*’ allele is reserved to indicate that the allele is missing due to a upstream deletion.
If there are no alternative alleles, then the missing value should be used.
Did you have any variants where the ALT was just *
? Like this:
NC_045512.2 512 . C * . . .
Nope, only as something like T,*
or A,*
I'm thinking for this I will annotate the T
, but add a field like has_asterisk=True
in the INFO
column.
Do you think that would work for you?
For me, yes. If you add an explanation somewhere, I guess it'll be fine.
Another example with *
in the column https://github.com/rpetit3/vcf-annotator/issues/9#issuecomment-918117810
Tagging @BioWilko - How would you like the *
to be treated?
I'm honestly not sure, I think your suggestion above is probably the best option. I think it's a case of snp-sites seeing a lower case "n" and getting confused but I wouldn't quote me on that.
Hi Robert!
First, Congrats with the project! It's pretty cool!
I've decided to give it a try for my project (virus diversity) and I realized that in my multi-sample VCFs I tend to have many genome positions with multiple possible variants (both nucleotides and deletions ""). Vcf-annotator stops when in the ALT column there are multiple symbols as in "A,G,".
As enhancement I think it would be great if it can operate with this type of data.
Cheers
Joan