Clusters with unrelated gene sets (overlap coefficient = 0)

Hi @jokergoo,

I came across this issue while trying to create clusters of pathways from msigdb:

When clustering gene sets there some clusters whose members very low correlation, include pairwise correlations of zero. Details and reproducible example below.

It might be an expected consequence of the methods themselves, but it is odd. My question is if you can explain it this behavior. It might be related to this issue

I am also aware of this caveat ;)

But note, as we benchmarked in the manuscript, the clustering on the gene overlap similarity performs much worse than on the semantic similarity.

library("msigdbr")
library("simplifyEnrichment")
set.seed(42)

To reproduce the issue I collected some random pathways from collection which are relevant for my real life example, and added one particular pathways which caused me to note the issue.

gene_sets <- msigdbr(species = "Homo sapiens") %>%
    subset(gs_subcat %in% c("CP:WIKIPATHWAYS", "CP:REACTOME", "GO:BP", "CP:BIOCARTA")) %>%
    .[["gs_name"]] %>%
    unique()

gene_sets_mall <- c(
    sample(gene_sets, 499),
    "WP_MIRNA_BIOGENESIS"
)

Following the package instructions, the similarity matrix is calculated using overlap because one of my ideas was to reduce the sets which are contained in others. I have not tested other similarity metrics.

sim_ov <- term_similarity_from_MSigDB(
  unique(gene_sets_mall),
  method = "overlap"
)

I followed with the clustering itself, here using dynamicTreeCut but other methods showed the same issues (bin_cut and apcluster). For data inspection I created a new table with the pairwise correlations

clusters_dtc <- simplifyEnrichment(
  sim_ov,
  method = "dynamicTreeCut"
) %>%
  `names<-`(replace(names(.), 1, c("gs_name"))) %>%
  plyr::arrange(cluster)

## Cluster 500 terms by 'dynamicTreeCut'... 42 clusters, used 0.223052 secs.

dynamicTreeCut-1

ov_df <- sim_ov %>%
    as.data.frame() %>%
    tibble::rownames_to_column() %>% 
    tidyr::pivot_longer(-rowname) %>%
    `names<-`(replace(names(.), 1:3, c("gs1", "gs2", "overlap")))

head(ov_df)

## # A tibble: 6 × 3
##   gs1                                                              gs2   overlap
##   <chr>                                                            <chr>   <dbl>
## 1 GOBP_INTRINSIC_APOPTOTIC_SIGNALING_PATHWAY_BY_P53_CLASS_MEDIATOR GOBP…  1     
## 2 GOBP_INTRINSIC_APOPTOTIC_SIGNALING_PATHWAY_BY_P53_CLASS_MEDIATOR GOBP…  0     
## 3 GOBP_INTRINSIC_APOPTOTIC_SIGNALING_PATHWAY_BY_P53_CLASS_MEDIATOR REAC…  0     
## 4 GOBP_INTRINSIC_APOPTOTIC_SIGNALING_PATHWAY_BY_P53_CLASS_MEDIATOR GOBP…  0.0127
## 5 GOBP_INTRINSIC_APOPTOTIC_SIGNALING_PATHWAY_BY_P53_CLASS_MEDIATOR REAC…  0.0380
## 6 GOBP_INTRINSIC_APOPTOTIC_SIGNALING_PATHWAY_BY_P53_CLASS_MEDIATOR GOBP…  0

Now let’s find the offending cluster and pick a biologically unrelated gene set:

subset(clusters_dtc, gs_name == "WP_MIRNA_BIOGENESIS")

##                 gs_name cluster
## 478 WP_MIRNA_BIOGENESIS      41

clust <- 29
subset(clusters_dtc, cluster == clust) %>%
    head()

##                                                                    gs_name
## 229 WP_ENVELOPE_PROTEINS_AND_THEIR_POTENTIAL_ROLES_IN_EDMD_PHYSIOPATHOLOGY
## 230                                            REACTOME_SIGNALLING_TO_ERKS
## 231                                                  BIOCARTA_CDK5_PATHWAY
## 232           GOBP_REGULATION_OF_EARLY_ENDOSOME_TO_LATE_ENDOSOME_TRANSPORT
## 233                                 WP_INTERLEUKIN1_IL1_STRUCTURAL_PATHWAY
## 234                 GOBP_REGULATION_OF_RECEPTOR_SIGNALING_PATHWAY_VIA_STAT
##     cluster
## 229      29
## 230      29
## 231      29
## 232      29
## 233      29
## 234      29

subset(ov_df, gs1 == "WP_MIRNA_BIOGENESIS" & gs2 == "REACTOME_TYROSINE_CATABOLISM")

## # A tibble: 0 × 3
## # ℹ 3 variables: gs1 <chr>, gs2 <chr>, overlap <dbl>

As we can see, those two gene sets, which are in the same cluster have no overlap.

Since this could simply be becasue there are stronger overlaps with the other gene sets in the cluster, I checked all pairwise similarity values for WP_MIRNA_BIOGENESIS in that cluster vs all clusters:

subset(ov_df, gs1 == "WP_MIRNA_BIOGENESIS" & gs2 %in% subset(clusters_dtc, cluster == 29)$gs_name)$overlap %>%
    summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0       0       0       0

subset(ov_df, gs1 == "WP_MIRNA_BIOGENESIS")$overlap %>%
    summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0055  0.0000  1.0000

The mean similarity values is higher in the cluster, but still quite low. But how low?

So I compared the mean similarity per cluster, and it seems like this cluster as the lowest mean similarity score of all clusters by some margin:

avg_scores <- sapply(1:max(clusters_dtc$cluster), function(x){
subset(ov_df, gs2 %in% subset(clusters_dtc, cluster == x)$gs_name)$overlap %>%
    mean()
    }
)

cl_score_df <- data.frame(
    cluster = 1:max(clusters_dtc$cluster),
    avg_scores
)
cl_score_df[order(cl_score_df$avg_scores), ] %>%
    knitr::kable()

	cluster	avg_scores
15	15	0.0080055
41	41	0.0096109
37	37	0.0103383
23	23	0.0126614
14	14	0.0129533
13	13	0.0133325
32	32	0.0145801
16	16	0.0149111
24	24	0.0149260
42	42	0.0151883
38	38	0.0157267
22	22	0.0158478
27	27	0.0161900
40	40	0.0162276
34	34	0.0163880
25	25	0.0164008
21	21	0.0167318
30	30	0.0177726
31	31	0.0179852
26	26	0.0199100
33	33	0.0201146
19	19	0.0202515
39	39	0.0204305
10	10	0.0204674
35	35	0.0214680
11	11	0.0227303
36	36	0.0231867
18	18	0.0235248
3	3	0.0241443
28	28	0.0243209
29	29	0.0251034
5	5	0.0253350
2	2	0.0257065
8	8	0.0280559
20	20	0.0291000
7	7	0.0293294
4	4	0.0320534
12	12	0.0331101
9	9	0.0363704
17	17	0.0396287
6	6	0.0675694
1	1	0.1610511

So it seems like the clustering methods, at least those tested end up putting a number of what should be singletons in a single cluster. Is there a way to alleviate this issue, or a clustering method (or similarity metric) which is less prone to this issue?

Cheers!

jokergoo / simplifyEnrichment

Clusters with unrelated gene sets (overlap coefficient = 0) #106