neherlab / treetime

Maximum likelihood inference of time stamped phylogenies and ancestral reconstruction
MIT License
224 stars 55 forks source link

Improve VCF parsing #263

Closed jameshadfield closed 8 months ago

jameshadfield commented 10 months ago

This isn't complete yet, but if people wanted to take a look and see if they are happy with the direction this is going in that'd be great!

I still want to take a close look at how insertions are being encoded in the "sequences" output, and is treetime/augur is ok with that (i.e. data['sequences'][<sample>][<pos>] is more than one character). And ideally I'd sort out the writing of VCFs as well to both preserve the ploidy representation of alleles and also use the correct chromosome name!

jameshadfield commented 10 months ago

@rneher could you suggest someone to review it (or look at it yourself?) 🙏

There are a number of bugfixes related to parsing haploid (or polyploid homozygous) VCFs and I'm now reasonably confident the parsing is correct. The encoding of insertions has the potential to double count the base before the insertion (see comments added in code here) but I don't know what treetime does with these; for the purposes of augur ancestral, which is what motivated this work, insertions are not considered.

I haven't touched the parsing of heterozygous genotypes.

I've fixed a number of bugs in the writing of VCFs, but I haven't thoroughly reviewed that code and the test VCF I've added here is pretty simple.

rneher commented 9 months ago

thanks, @jameshadfield! I'll have a look. This code hasn't really been looked at in a long time and was written for rather specific needs in our early MTb work, so really appreciate some attention!

rneher commented 9 months ago

this all looks good to me. happy to merge it in and make a release.

jameshadfield commented 9 months ago

this all looks good to me. happy to merge it in and make a release.

Thanks! I'm just doing some final checks comparing augur ancestral & augur translate using FASTA vs VCF inputs on the Cholera dataset so let's hold off on this for a few days just in case there are any more little fixes I need to add.

jameshadfield commented 9 months ago

@rneher I'm happy with the state of this now and don't plan to do any more work here. There were a few more commits added since you last saw it. Once this is merged and released I'll update https://github.com/nextstrain/augur/pull/1355 to use the newest version.