mbhall88 / head_to_head_pipeline

Snakemake pipelines to run the analysis for the Illumina vs. Nanopore comparison.
GNU General Public License v3.0
5 stars 2 forks source link

SNP distance between Illumina and Nanopore calls #41

Closed mbhall88 closed 4 years ago

mbhall88 commented 4 years ago

One of the main figures for the paper will show a scatter plot of pairwise SNP distances. The two axes will be Illumina SNP distance and Nanopore SNP distance for each pair of samples. If clustering with nanopore is interchangeable with Illumina clustering we would hope to see a perfect diagonal line from the bottom left to top right indicating the distance between samples is the same for both technology's SNP calls. Failing this, we would hope for a linear relationship of some sort which would allow us to provide a recommendation for SNP thresholds when using Nanopore.

mbhall88 commented 4 years ago

Interactive Heatmaps for each caller's distance matrix and the interactive dotplot of both matrices against each other are not in the report [41dc67c]. Some static screenshots below.

dotplot

image

bcftools heatmap

image

compass heatmap

image

iqbal-lab commented 4 years ago

Are the samples ordered the same in the two heatmaps?

iqbal-lab commented 4 years ago

order looks same to me

iqbal-lab commented 4 years ago

interesting that compass SNP distance is consistently higher than bcftools. also, if you replot the scatter plot zoomed in for distances <50bp, looks like everything is on y=x. This zoomed in bit is what determines whether things are called as a cluster

mbhall88 commented 4 years ago

Are the samples ordered the same in the two heatmaps?

Yes

also, if you replot the scatter plot zoomed in for distances <50bp, looks like everything is on y=x. This zoomed in bit is what determines whether things are called as a cluster

Yes, I noticed this and had the same thought. It's easy enough to create an equation for data points below a distance of N. I will do this today.

mbhall88 commented 4 years ago

Ok, so I have added a plot for "close" samples (as deemed by compass) and changed the behaviour of both plots slightly. The change is just the removal of the identity pairs. That is, the diagonal line in the distance matrix (in addition to the lower triangle which I was already doing). These identity pairs should not be contributing to the calculation of the line of best fit as they skew the parameters. The report has been updated to reflect this in https://github.com/mbhall88/head_to_head_pipeline/commit/5566b8a854bf1c947978ce2270be13d910e350a1

Full dataset

image

Close dotplot

Close is defined as compass SNP distance of 100 or less.

image

Conclusion

So for defining clusters, it seems it is better to use the "closer" equation.

iqbal-lab commented 4 years ago

Possibly I am misunderstanding your last post. Here's a thought experiment. Suppose there are a million points. Suppose all dots lie on y=x, except one, that lies on the x axis. What should the line of best fit be? If we ignore the 999999 points on y=x, we conclude it is the x axis

mbhall88 commented 4 years ago

I'm not sure I understand what you're getting at with the thought experiment?

iqbal-lab commented 4 years ago

OK, easier to chat in 20 mins and clarify here after