jokergoo / ComplexHeatmap

Make Complex Heatmaps
https://jokergoo.github.io/ComplexHeatmap-reference/book/
Other
1.31k stars 230 forks source link

clustering with mahalanobis distance #60

Closed crazyhottommy closed 8 years ago

crazyhottommy commented 8 years ago

Hi,

According to this post https://liorpachter.wordpress.com/2014/01/19/why-do-you-look-at-the-speck-in-your-sisters-quilt-plot-and-pay-no-attention-to-the-plank-in-your-own-heat-map/

mahalanobis distance is good for high variance high expressed genes. How can I use it in ComplexHeatmap?

I googled around:

The manhattan distance and the Mahalanobis distances are quite different. One of the main differences is that a covariance matrix is necessary to calculate the Mahalanobis distance, so it's not easily accomodated by dist. There is a function in base R which does calculate the Mahalanobis distance -- mahalanobis(). So if you pass a distance matrix calculated by mahalanobis()

From the documentation
There are three ways to specify distance metric for clustering:

specify distance as a pre-defined option. The valid values are the supported methods in dist() function and within pearson, spearman and kendall. NA values are ignored for pre-defined clustering but with giving warnings (see example in Colors section).

a self-defined function which calculates distance from a matrix. The function should only contain one argument. Please note for clustering on columns, the matrix will be transposed automatically.

a self-defined function which calculates distance from two vectors. The function should only contain two arguments.

mahalanobis() has to specify at least 2 arguments. m matrix itself and a covariance matrix cov. It will be better to let ComplexHeatmap receives any number of arguments, so I can put center=FALSE as well.

Heatmap(Y[genes.3PC, ], name = "log2 RNAseq\ncounts scaled", 
        col = colorRamp2(c(-10, 0, 10), c("Darkblue", "white", "red")),
        show_row_names = FALSE,
        show_column_names = FALSE, 
        row_dend_reorder = TRUE, column_dend_reorder = FALSE, 
        clustering_distance_rows = function(m) mahalanobis(m, center =FALSE, cov = cov(m)),
        clustering_distance_columns = function(m) mahalanobis(m, center =FALSE, cov = cov(m)),
        clustering_method_rows = "complete",
        clustering_method_columns = "complete",
        top_annotation = ha.brain,
        gap = unit(0.5, "mm"))
Error in solve.default(cov, ...) : 
  system is computationally singular: reciprocal condition number = 5.56982e-19 

Thanks! Tommy

jokergoo commented 8 years ago

This line:

clustering_distance_rows = function(m) mahalanobis(m, center = FALSE, cov = cov(m)),

only needs m to be defined, thus one argument for this function is enough. You don't need to pass center and cov like function(m, center, cov), you can just put these two in mahalanobis(), as you wrote. Or you can understand in this way: the value of clustering_distance_rows is not directly mahalanobis while is a modified version of mahalanobis which has a fixed value of center and cov.

mahalanobis2 = function(m) {
    cov = cov(m)
    mahalanobis(m, center = FALSE, cov = cov)
}
Heatmap(...,
    clustering_distance_rows = mahalanobis2
)

On the other hand, you can perform clustering on rows/columns first, then pass the clustering object (e.g. a hclust or dendrogram object) to cluster_rows or cluster_columns:

row_dist = mahalanobis(Y[genes.3PC, ], center =FALSE, cov = cov(Y[genes.3PC, ]))
row_hclust = hclust(row_dist)
Heamtap(..., cluster_rows = row_hclust)

The error you shown is not because the ComplexHeatmap package, it is because the matrix which is involved in mahalanobis distance calculation is singular.

crazyhottommy commented 8 years ago

Thanks very much for your detailed answer. I understand it now. Maybe off topic of ComplexHeatmap, but do you have any idea on the singular error? Given a gene expression matrix, how to choose genes for heatmap, and do bi-clustering to avoid this singular error?

I googled around and found http://stackoverflow.com/questions/21451664/system-is-computationally-singular-error

http://stats.stackexchange.com/questions/76488/error-system-is-computationally-singular-when-running-a-glm

Thanks!

jokergoo commented 8 years ago

Sorry I cannot help you with this. I have very limited knowledge of matrix algebra. But in the case of gene expression, I think mahalanobis method can only be used on columns because normally number of rows in a gene expression matrix is larger than the number of samples. mahalanobis() use solve() to calculate the inverse of the covariance matrix. You can check the help page of solve() function.

crazyhottommy commented 8 years ago

thanks for the suggestion. I should only calculate for columns (samples).