roblanf / sarscov2phylo

Global phylogenies of SARS-CoV-2 sequences
GNU General Public License v3.0
86 stars 22 forks source link

Global masking VCF #6

Closed tseemann closed 4 years ago

tseemann commented 4 years ago

https://github.com/W-L/ProblematicSites_SARS-CoV2

This study suggests masking sites. Could you support this? Does it coincide with your filtering etc?

roblanf commented 4 years ago

Yep, working on it now. I'm not 100% convinced that all of those sites should be masked, but I'm also keen to make sure that decisions are all evidence-based. So my feeling for now is that I mask all of those sites they suggest, and that anyone (including me) who thinks the masking should differ presents the evidence on their repo. As long as they are receptive to discussion and new evidence, this won't be a problem.

roblanf commented 4 years ago

also see https://github.com/roblanf/sarscov2phylo/issues/2

roblanf commented 4 years ago

Now done. Latest tree is currently running, so I'll close this once it's finished in ~24h.

But just to note, I mask every site in this file:

https://raw.githubusercontent.com/W-L/ProblematicSites_SARS-CoV2/master/subset_vcf/problematic_sites_sarsCov2.mask.vcf

It also requires the latest pre-release of goalign, which I just wget (nice solution eh?)

roblanf commented 4 years ago

Done. Tree from filtered sequences is in this release: https://github.com/roblanf/sarscov2phylo/releases/tag/6-6-20

roblanf commented 4 years ago

Perhaps of interest @tseemann, I thought I'd check to see if masking the additional ~30 sites impacts the branch support. Short answer, it really doesn't seem to.

Here's the distribution of branch support in the tree from 3rd of June (without masking sites)

image

And here's the same distribution from 6th of June (after masking sites)

image

Before masking sites, 34.6% of TBE values are >0.5. After masking sites, this changes to 34.7%. Similarly, before masking sites, 2.6% of TBE values are >0.9. After masking sites, this changes to 2.5%.

So, as expected, we appear to be masking sites that do little or nothing to help infer the tree.