Closed mbhall88 closed 4 years ago
Interactive Heatmaps for each caller's distance matrix and the interactive dotplot of both matrices against each other are not in the report [41dc67c]. Some static screenshots below.
Are the samples ordered the same in the two heatmaps?
order looks same to me
interesting that compass SNP distance is consistently higher than bcftools. also, if you replot the scatter plot zoomed in for distances <50bp, looks like everything is on y=x. This zoomed in bit is what determines whether things are called as a cluster
Are the samples ordered the same in the two heatmaps?
Yes
also, if you replot the scatter plot zoomed in for distances <50bp, looks like everything is on y=x. This zoomed in bit is what determines whether things are called as a cluster
Yes, I noticed this and had the same thought. It's easy enough to create an equation for data points below a distance of N. I will do this today.
Ok, so I have added a plot for "close" samples (as deemed by compass) and changed the behaviour of both plots slightly. The change is just the removal of the identity pairs. That is, the diagonal line in the distance matrix (in addition to the lower triangle which I was already doing). These identity pairs should not be contributing to the calculation of the line of best fit as they skew the parameters. The report has been updated to reflect this in https://github.com/mbhall88/head_to_head_pipeline/commit/5566b8a854bf1c947978ce2270be13d910e350a1
Close is defined as compass SNP distance of 100 or less.
So for defining clusters, it seems it is better to use the "closer" equation.
Possibly I am misunderstanding your last post. Here's a thought experiment. Suppose there are a million points. Suppose all dots lie on y=x, except one, that lies on the x axis. What should the line of best fit be? If we ignore the 999999 points on y=x, we conclude it is the x axis
I'm not sure I understand what you're getting at with the thought experiment?
OK, easier to chat in 20 mins and clarify here after
One of the main figures for the paper will show a scatter plot of pairwise SNP distances. The two axes will be Illumina SNP distance and Nanopore SNP distance for each pair of samples. If clustering with nanopore is interchangeable with Illumina clustering we would hope to see a perfect diagonal line from the bottom left to top right indicating the distance between samples is the same for both technology's SNP calls. Failing this, we would hope for a linear relationship of some sort which would allow us to provide a recommendation for SNP thresholds when using Nanopore.