mbhall88 / head_to_head_pipeline

Snakemake pipelines to run the analysis for the Illumina vs. Nanopore comparison.
GNU General Public License v3.0
5 stars 2 forks source link

Quantify cluster similarities #65

Closed mbhall88 closed 3 years ago

mbhall88 commented 3 years ago

i.e. bcftools clustering is X% (or whatever metric) similar to the clustering produced by compass

mbhall88 commented 3 years ago

I think next week I will look at trying to use this to compare how each of the nanopore graphs' clusters compares to compass https://en.wikipedia.org/wiki/Fowlkes%E2%80%93Mallows_index

Can also try the Jaccard Index.
If I understand correctly, the Jaccard index is more like Accuracy (TP+TN/Total Popn) whereas Fowlkes-Mallows Index is the geometric mean of precision and recall (effectively the F-measure).

I guess it depends on what we are aiming for....

iqbal-lab commented 3 years ago

I really really want to see raw data before shrinking to summary indexes.can we just show the networks/clusters first? We need to look at this and gain intuition before crushing to a summary number. Even just lists of identifiers which are clusters would have utility

iqbal-lab commented 3 years ago

I like this FM index of yours btw

mbhall88 commented 3 years ago

A lot of the metrics for evaluating the performance of clustering rely on confusion matrix-like classifications (i.e. true positive, false negative etc.) I figured defining these classifications explicitly here and discussing some of the metrics would be beneficial for us later down the line.

Classifications are generally done w.r.t pairs of samples. For instance, is sample a clustered with sample b for all pairs of samples (without replacement). In this context, we can define the classifications as:

The first part of this analysis will just be to look at the confusion matrices these definitions produce for each caller. This will give us a good handle on whether we're missing clusters or overcalling them.

The next step will be to see what threshold maximises a given cost function. So we will sweep thresholds for each caller and see what is the maximal best clustering we can possible achieve with nanopore data.

Proposed cost functions

Firstly, let's define precision and recall

Precision

The fraction of correct clusterings made.

Recall

The fraction of correct clusterings found.

Fowlkes-Mallows Index (G-measure; FMI)

Geometric mean between precision and recall.

Since the index is directly proportional to the number of true positives, a higher index means greater similarity between the two clusterings used to determine the index.

One disadvantage of the FMI is that it doesn't take into account TN - although we may decide this is not a concern.

An example if its use can be found in the scikit learn user guide.

Matthew's Correlation Coefficient (MCC)

This is effectively a "chance-normalised" FMI. It takes into account TN, where FMI does not.

An example of its use can be found in the scikit learn user guide

F-measure

The harmonic mean of precision and recall. The weighted version is also called the F-beta-measure.

When beta is set to 0, you get precision. When beta is set to 1, you get the F-measure which is the harmonic mean of precision and recall. Setting beta to 2 weighs recall twice as much as precision. Setting beta to 0.5 weights precision twice as much as recall.


Discussion

I guess this boils down to whether we care much about TNs. My feeling is we probably don't want to equally weight them. In that case, I think the FMI seems the best choice. Although the F-measure is good to keep in mind in case we decide we actually care more about finding cluster than getting them right, or vice versa.

mbhall88 commented 3 years ago

You can view the exploration of this at https://nbviewer.jupyter.org/github/mbhall88/head_to_head_pipeline/blob/master/analysis/transmission_clustering/eda/clustering.ipynb

That notebook currently has plots of all clustering, along with confusion matrices and the metrics defined above.

To summarise,

I will do a threshold sweep and see how high/low we can push these metrics.

mbhall88 commented 3 years ago

The notebook has now been updated with the threshold sweep. We will probably need to discuss whether we are happy to balance precision and recall or if we care more about one than the other. This might be a question best put to the collaborators too as they might have more insight into this.

iqbal-lab commented 3 years ago

I'll take a look. My perception is that the epi folk don't look at errors at this level, between pairs of samples, but at the cluster level. So, for the 3 true illumina clusters (or however many it is), a) did we find 3 clusters (main criterion) b) for each cluster, do we add or lose samples. And i think our performance is literally evaluated looking at those. Do we overcall clusters and find too many people in each cluster (which is safe, we don't miss people, but does cost more contact tracing, which matters in some places and not others - and this is not as simple as high/low income), or do we miss people from clusters (which is bad, as we don't follow them up with contact tracing) This is why i was keen on the visualisation.

mbhall88 commented 3 years ago

Just added the f-beta score to the notebook (https://github.com/mbhall88/head_to_head_pipeline/commit/3a649c21828111b8b9057bf3061529d5f0c7fdde) - weighing recall twice as important as precision. Interestingly, we get near the same threshold as using the linear regression implied one.

mbhall88 commented 3 years ago

Just putting this here in case we want to come back to it Clustering.pdf

In the above document, Zam defines two metrics for a scenario where you have a sample, and you want to identify its neighbours (samples within N SNPs). How well does your method work at this?

These two metrics are (note I have changed the word neighbour to cluster to avoid confusion):

Scenario 2 was you have a set of outbreaks (clusters) and you want to know how often you wrongly rule people in/out.

I am going to lay out a way of approaching this that I think covers both scenarios.

Let's begin by looking at the Tversky Index.

It is an asymmetric similarity measure on sets. We can use properties of this asymmetry to define something like precision (positive predictive value; PPV) and recall (sensitivity; hit rate, true positive rate; TPR). (Sorry for the excess definitions, I just want to try and make sure we can define these times in all their forms as the discussion with Tim showed that precision and recall are known as other things to other groups).
If we set alpha and beta to 1 then we get the Jaccard Index. If we set alpha to 1 and beta to 0 we get something like recall. If we set alpha to 0 and beta to 1 we get something like precision. The thinking here is that you can think of the intersection as TPs. Now if we say X is "truth" (COMPASS) and Y is "test" (Nanopore), then X-Y is analogous to FNs - i.e. how many things in X are not in Y. It then follows that Y-X is analogous to FPs - I.e. how many things in Y are not in X.
Just to go back to the Jaccard Index for a moment

This can be thought of as the critical success index (also known as threat score)


So my plan/suggestion is to say for each non-singleton sample in the COMPASS clustering, define X as its cluster in COMPASS and Y as its cluster in bcftools and compute the precision and recall as above. We then have two ways of aggregating these scores. We can take the average for a cluster and we can also take the average for all samples.

One final thing we might like to think about is how we communicate these metrics. As there is a strong focus on reducing FNs (not excluding samples from clusters) we could present recall as false negative rate instead, which is just 1-recall. We might also be interested in false omission rate, which is FN/FN+TN, but it isn't clear to me how we would define TNs in the above approach.

Note: the Wikipedia entry on confusion matrices is very useful for remembering all the different metrics defined above and how they are calculated and what all their names are.

mbhall88 commented 3 years ago

Using the above definitions for precision and recall, let's take a look some parameter sweeps.

We have threshold on the x-axis. The line styles and colours represent the different metrics TPR and PPV and their value is encoded on the y-axis. The different rows indicate the distance. The bands around the lines are the 95% confidence interval and each point on the line is the mean of the metric over all nodes in the COMPASS graph. The red dashed, vertical line in each row is the threshold that would be chosen using the linear regression equation.

For example, in the first-row plot, follow the x-axis point for threshold=11 upwards until it intersects the PPV (blue) line. This point is the mean of PPV values over all nodes in the COMPASS graph for that distance. (Let me know if this still isn't clear and I'll have another go at explaining it)

(Sorry for the small font sizes)

image

mbhall88 commented 3 years ago

We can now look at the clusters in two different ways. We can focus on the individual nodes and style them in such a way to represent their precision and recall. Or we can style all the nodes in a cluster the same way as a method for encoding the information for each cluster. Below are the two approaches for clusters defined with COMPASS threshold 12 and bcftools threshold 11

Individual cluster metrics

The plots for this can be seen at https://nbviewer.jupyter.org/github/mbhall88/head_to_head_pipeline/blob/master/analysis/transmission_clustering/eda/clustering.ipynb#Individual-cluster-metrics

If we average over all nodes in the COMPASS clusters we get

Average Recall: 0.962
Average Precision: 0.829

Average cluster metrics

The plots can be seen at https://nbviewer.jupyter.org/github/mbhall88/head_to_head_pipeline/blob/master/analysis/transmission_clustering/eda/clustering.ipynb#Average-cluster-metrics

For the COMPASS clusters, we get the following

Size ACR12 ACP12
7 1.0 0.70
6 1.0 0.75
6 1.0 0.86
3 1.0 1.0
3 1.0 1.0
2 1.0 1.0
2 1.0 1.0
2 1.0 1.0
2 1.0 1.0
2 1.0 1.0
2 1.0 1.0
2 1.0 1.0
2 1.0 1.0
2 1.0 1.0
2 1.0 1.0
2 1.0 1.0
2 0.5 0.22
2 0.5 0.22
2 1.0 0.25
mbhall88 commented 3 years ago

After fixing the sample swap from #68 we have updated results.

Distance 12

Truth clusters annotated with similarity metrics from bcftools

bokeh_plot

Reminder: the inner colouring is recall and the outer colouring is precision

The non-perfect precision is due to the merging of two truth clusters, two cases of a single node being added to a truth cluster, and one case of two clusters being joined with the addition of another random node.

Distance 5

bokeh_plot

The only "truth" errors were adding two nodes to a truth cluster. bcftools also found 2 additional two-sample clusters.

Distance 2

bokeh_plot

The only "truth" errors were adding one node to a truth cluster and creating a two-sample cluster that doesn't exist in the truth set.

Distance 0

bokeh_plot

bcftools has 1 two-sample cluster that doesn't exist in the truth set - sample pair 17_616026 x 17_616156.


One last plot is the parameter sweep from above, but with an extra line/metric called N_diff - which is the number of truth clusters divided by the number of bcftools clusters.

image


Conclusion

I am going to tentatively say that for the following distances, the corresponding bcftools threshold should be:

However, I don't think this is final. There is a slight possibility that the mixed distances analysis (#70) may bring some other sample swap/outliers to our attention.

iqbal-lab commented 3 years ago

@mbhall88 i'm very sorry - this has been a long week with grant and then external review, so this response has been slow. I'm going to give multiple response comments. Headline is i'm quite excited

For this comment

https://github.com/mbhall88/head_to_head_pipeline/issues/65#issuecomment-797895910

I found this very clear, great.

iqbal-lab commented 3 years ago

However for this comment with the parameter sweeps, i don't understand the plots, sorry

https://github.com/mbhall88/head_to_head_pipeline/issues/65#issuecomment-798892173

ii understand all the definitions, all v clear, but i think a key confusion for me is, what is a row?

"We have threshold on the x-axis. The line styles and colours represent the different metrics TPR and PPV and their value is encoded on the y-axis. The different rows indicate the distance. The bands around the lines are the 95% confidence interval and each point on the line is the mean of the metric over all nodes in the COMPASS graph. The red dashed, vertical line in each row is the threshold that would be chosen using the linear regression equation.

For example, in the first-row plot, follow the x-axis point for threshold=11 upwards until it intersects the PPV (blue) line. "

iqbal-lab commented 3 years ago

You sort of deserve a PhD just for this

https://github.com/mbhall88/head_to_head_pipeline/issues/65#issuecomment-802569268

This is great. Couple of things

  1. I really feel this is very close to being completely convincing, and much better than i expected
  2. In every PhD , i hit a point where i have a bit of a panic that i need to do a better job at teaching the student in question how to manage their manager (and their audience, more generally). In this case, i'm thinking about the epi audience, and how we describe our results to them. That comment linked above was fantastic, and indeed all of the recent work on this issue. Great great work, and v glad these are good results.
iqbal-lab commented 3 years ago

One open issue - buried in a comment above, you talk about whether to average over clusters or samples. Which are you doing?

mbhall88 commented 3 years ago

However for this comment with the parameter sweeps, i don't understand the plots, sorry

#65 (comment)

ii understand all the definitions, all v clear, but i think a key confusion for me is, what is a row?

Sorry, I used the word "row" which is just a word from the plotting library. Basically those plots are a "single" plot and the "rows" are just the individual plots... I hope that made more sense?

mbhall88 commented 3 years ago

One open issue - buried in a comment above, you talk about whether to average over clusters or samples. Which are you doing?

The single figures I presented, i.e. ACR5 etc. are averaged over all individual nodes. The colours in the cluster plots represent the average for the cluster.

iqbal-lab commented 3 years ago

Great thanks. Will ponder a bit more, but this is a great way to end the week

mbhall88 commented 3 years ago

A question from @iqbal-lab was how do the pandora clusters look after fixing the sample swaps. I'll use the parameter sweep plots above to succinctly illustrate this performance.

Pandora map

image

Pandora compare

image

The compare results are actually quite decent. I will rerun the clustering for compare and select the thresholds that look best from those sweeps

mbhall88 commented 3 years ago

Pandora compare clustering

I've also added the confusion matrices for each as it helps identify whether there are lots of extra clusters as that can't be told from the clustering plots. Reminder: the confusion matrices are based on the pairwise clustering metrics.

Distance 12

bokeh_plot

image

Distance 5

bokeh_plot

image

Distance 2

bokeh_plot

image

Distance 0

bokeh_plot

image


No false negatives at any of those thresholds is really good! We do have a lot more FPs than bcftools though (except for at distance 0 where we have none).

mbhall88 commented 3 years ago

@iqbal-lab, I have a question about the current approach methodology. SACP and SACR cannot - with the current way of running - account for if Nanopore adds new clusters with samples not in the Illumina clustering. For example, if Nanopore clustering has a cluster with sample C and D, but sample C and D do not occur in any cluster in the Illumina graph, then we don't see it with SACP and SACR if they do not exist in a cluster with a sample that does appear in the Illumina graph.

Example

G - Illumina toy

H - Nanopore H

In this example, the cluster with samples L and M do not factor into the calculation of SACR and SACP.

The reason for this is I calculate SACP and SACR for each sample in the Illumina graph only. I did this partly because the visualisation is based on the Illumina clusters.
There are a few ways forward I guess:

  1. It doesn't matter - carry on
  2. Calculate SACP and SACR for all samples in either graph - not sure what to do about the visualisation then? I guess I would have to create a plot for each technology...
  3. Calculate SACP and SACR for all samples. I see there being two problems with this i) The same visualisation problem as 3. ii) Singletons will potentially overpower the metrics
iqbal-lab commented 3 years ago

Argh! Good spot. OK, i think well SACR should not change, it is trying to measure (roughly) what %of true clusters are found. SaCP... Well, we could add a new metric, which we don't plot, called XCR (excess clustering rate) asking what % of true singletons are clustered. That keeps everything clearly defining concepts we care about, what do you think

mbhall88 commented 3 years ago

Yes, I realised last night that given we are assuming Illumina is truth there isn't actually a problem with SACR and SAP. But I really like the XCR metric. I'll get that value for each threshold and also replace the CNR line in the simulations plots.

mbhall88 commented 3 years ago

Right, so we define excess clustering rate (XCR) as

where A is defined as the set of Illumina singletons and B is defined as the set of Nanopore singletons.

With this definition, we get the following XCR values for bcftools