nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 129 forks source link

[vcf] Replace `write_VCF_translation` with treetime's `write_vcf` #1356

Open jameshadfield opened 9 months ago

jameshadfield commented 9 months ago

augur.io.vcf includes write_VCF_translation which is used by augur translate. This functionality is better implemented (and now tested) in treetime so we should use that function instead.

jameshadfield commented 9 months ago

A few observations after looking into this today:

Minor errors in VCF genotype syntax The write_VCF_translation produces VCFs which look like this

##fileformat=VCFv4.2
##source=NextStrain_Protein_Translation
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  node_root   node_AB sample_A    sample_B    sample_C
gene1   2   .   P   L   .   PASS    .   GT  .   1/1 1/1 1/1 .
gene2   2   .   V   D   .   PASS    .   GT  .   1/1 1/1 1/1 .

The genotype calls should be haploid (i.e. 1 not 1/1) and the intention of genotype . here is "no change from REF" but . actually means "a call cannot be made at the given locus".

VCF files can't represent amino acid diversity

The VCF spec doesn't seem to say that it's for nucleotide sequences only, but the 4.3 spec defines genotype ALTs as matching ^([ACGTNacgtn]+|\*|\.)$ so the intention is certainly nuc-only. I don't see why we couldn't make a VCF-like file for amino acids, and it'd mostly work, but I don't know how we'd represent stop codons because ALT=* is reserved for "allele missing due to overlapping deletion".

write_vcf can't write these data currently

It only writes one chromosome, and here we use different chromosomes for different genes. It wouldn't be hard to refactor it to accept multi-chromsome inputs. It's currently only used in 1 place in Augur and 1 place in TreeTime. Alternatively we could keep things so that write_vcf writes single chromosome VCFs and add a write_multi_chrom_vcf. If we do this we should probably extend read_vcf to read in multiple chromosomes.

But before embarking on this we need to decide whether we should be producing AA-VCFs. I know augur sequence-traits uses them.

emmahodcroft commented 8 months ago

VCF files can't represent amino acid diversity

I may be completely and totally misremembering but I have some kind of vague memory of writing code that did output AAs in VCF format. This would very likely not at all have been official, but was my effort to produce a large number of gene translations in some kind of compressed format. As I remember, I also used genes in place of chromosomes as you reference in the 3rd part above.

If that's gone from the code now though, I can't really when where or when it disappeared...

jameshadfield commented 8 months ago

I may be completely and totally misremembering but I have some kind of vague memory of writing code that did output AAs in VCF format. This would very likely not at all have been official, but was my effort to produce a large number of gene translations in some kind of compressed format. As I remember, I also used genes in place of chromosomes as you reference in the 3rd part above.

Yup! That's exactly the behaviour of write_VCF_translation (now located within augur.io.vcf). And I think it's ok to use VCF for this, but because it's not part of the spec we can't use TreeTime's write_VCF and so we have to maintain parallel VCF-serialisation code, and parallel VCF-parsing code for the same reasons. The AA-VCF file is also only used by augur sequence-traits as far as I know, so this is a minor concern.