Open jameshadfield opened 9 months ago
A few observations after looking into this today:
Minor errors in VCF genotype syntax
The write_VCF_translation
produces VCFs which look like this
##fileformat=VCFv4.2
##source=NextStrain_Protein_Translation
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT node_root node_AB sample_A sample_B sample_C
gene1 2 . P L . PASS . GT . 1/1 1/1 1/1 .
gene2 2 . V D . PASS . GT . 1/1 1/1 1/1 .
The genotype calls should be haploid (i.e. 1
not 1/1
) and the intention of genotype .
here is "no change from REF" but .
actually means "a call cannot be made at the given locus".
VCF files can't represent amino acid diversity
The VCF spec doesn't seem to say that it's for nucleotide sequences only, but the 4.3 spec defines genotype ALTs as matching ^([ACGTNacgtn]+|\*|\.)$
so the intention is certainly nuc-only. I don't see why we couldn't make a VCF-like file for amino acids, and it'd mostly work, but I don't know how we'd represent stop codons because ALT=*
is reserved for "allele missing due to overlapping deletion".
write_vcf
can't write these data currently
It only writes one chromosome, and here we use different chromosomes for different genes. It wouldn't be hard to refactor it to accept multi-chromsome inputs. It's currently only used in 1 place in Augur and 1 place in TreeTime. Alternatively we could keep things so that write_vcf
writes single chromosome VCFs and add a write_multi_chrom_vcf
. If we do this we should probably extend read_vcf
to read in multiple chromosomes.
But before embarking on this we need to decide whether we should be producing AA-VCFs. I know augur sequence-traits
uses them.
VCF files can't represent amino acid diversity
I may be completely and totally misremembering but I have some kind of vague memory of writing code that did output AAs in VCF format. This would very likely not at all have been official, but was my effort to produce a large number of gene translations in some kind of compressed format. As I remember, I also used genes in place of chromosomes as you reference in the 3rd part above.
If that's gone from the code now though, I can't really when where or when it disappeared...
I may be completely and totally misremembering but I have some kind of vague memory of writing code that did output AAs in VCF format. This would very likely not at all have been official, but was my effort to produce a large number of gene translations in some kind of compressed format. As I remember, I also used genes in place of chromosomes as you reference in the 3rd part above.
Yup! That's exactly the behaviour of write_VCF_translation
(now located within augur.io.vcf
). And I think it's ok to use VCF for this, but because it's not part of the spec we can't use TreeTime's write_VCF
and so we have to maintain parallel VCF-serialisation code, and parallel VCF-parsing code for the same reasons. The AA-VCF file is also only used by augur sequence-traits
as far as I know, so this is a minor concern.
augur.io.vcf
includeswrite_VCF_translation
which is used byaugur translate
. This functionality is better implemented (and now tested) in treetime so we should use that function instead.