mbhall88 / head_to_head_pipeline

Snakemake pipelines to run the analysis for the Illumina vs. Nanopore comparison.
GNU General Public License v3.0
5 stars 2 forks source link

Produce distance matrix from multi-sample VCF #60

Closed mbhall88 closed 3 years ago

mbhall88 commented 3 years ago

One thing we spoke about when we originally discussed this @iqbal-lab was when doing the pairwise distances from the compare VCF it might be best to do this by encoding the GT field in a matrix and calculating the distance from this.

One (potential) problem I see with this approach (although this may be a feature?) is that say sample A has GT 1 and sample B has GT 2, we give them a distance of 1. However, if GT 1 and GT 2 are 2 SNPs different from each other, their distance - in the conventional sense - would be 2.

iqbal-lab commented 3 years ago

Does this "if GT 1 and GT 2 are 2 SNPs different from each other" mean the edit distance between alleles 1 and 2 is 2?

or do you mean we have a triallelic site, so GT1 and GT2 are both SNPs different from the ref?

mbhall88 commented 3 years ago

Yes, the edit distance between them is 2.

iqbal-lab commented 3 years ago

Right, so I'd ignore any nonSNP variant to match PHE and the field, we're doing a SNP distance of clockwise SNPs and the edit 2 stuff might have occurred in one event.

mbhall88 commented 3 years ago

So the decision from the meeting is to try genotype distance and see how that works

mbhall88 commented 3 years ago

7 documents the results of this matrix. Closing as the matrix is produced and will track performance on #7