nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 128 forks source link

[vcf] replace VCF reading with TreeTime's `read_vcf` #1357

Open jameshadfield opened 8 months ago

jameshadfield commented 8 months ago

Augur uses a variety of functions to read VCF files, including TreeTime's read_vcf which now includes a lot of validation and helpful error messages. We should shift all our VCF reading to use this function.

augur mask

Augur mask uses a function get_chrom_name(vcf_file) to read (part of) the VCF file. See #507 for more context.

augur sequence-traits

defines its own read_in_translate_vcf function, but see this comment for context.

augur.io.vcf

Update: removed in #1366

augur.io.vcf includes read_vcf which is (only) used by augur index. It's a bit of a misnomer as it actually only returns the strain names (from the VCF header). The treetime function read_vcf is a more thorough function as it (now) checks the validity of the VCF file, however it doesn't actually return a list of the strain names. It could very easily be extended to do so, and then I think we should replace Augur's function with TreeTime's.

(TreeTime's function will be slower because it parses the entire VCF, but (a) I think this difference will be negligible and (b) this will flag up errors in the VCF data lines which I think is appropriate.)

jameshadfield commented 7 months ago

(I'm going to re-open this issue and generalise it.)