jokergoo / simplifyEnrichment

Simplify functional enrichment results
https://jokergoo.github.io/simplifyEnrichment
Other
108 stars 16 forks source link

Clusters with unrelated gene sets (overlap coefficient = 0) #106

Open adomingues opened 5 months ago

adomingues commented 5 months ago

Hi @jokergoo,

I came across this issue while trying to create clusters of pathways from msigdb:

When clustering gene sets there some clusters whose members very low correlation, include pairwise correlations of zero. Details and reproducible example below.

It might be an expected consequence of the methods themselves, but it is odd. My question is if you can explain it this behavior. It might be related to this issue

I am also aware of this caveat ;)

But note, as we benchmarked in the manuscript, the clustering on the gene overlap similarity performs much worse than on the semantic similarity.

library("msigdbr")
library("simplifyEnrichment")
set.seed(42)

To reproduce the issue I collected some random pathways from collection which are relevant for my real life example, and added one particular pathways which caused me to note the issue.

gene_sets <- msigdbr(species = "Homo sapiens") %>%
    subset(gs_subcat %in% c("CP:WIKIPATHWAYS", "CP:REACTOME", "GO:BP", "CP:BIOCARTA")) %>%
    .[["gs_name"]] %>%
    unique()

gene_sets_mall <- c(
    sample(gene_sets, 499),
    "WP_MIRNA_BIOGENESIS"
)

Following the package instructions, the similarity matrix is calculated using overlap because one of my ideas was to reduce the sets which are contained in others. I have not tested other similarity metrics.

sim_ov <- term_similarity_from_MSigDB(
  unique(gene_sets_mall),
  method = "overlap"
)

I followed with the clustering itself, here using dynamicTreeCut but other methods showed the same issues (bin_cut and apcluster). For data inspection I created a new table with the pairwise correlations

clusters_dtc <- simplifyEnrichment(
  sim_ov,
  method = "dynamicTreeCut"
) %>%
  `names<-`(replace(names(.), 1, c("gs_name"))) %>%
  plyr::arrange(cluster)
## Cluster 500 terms by 'dynamicTreeCut'... 42 clusters, used 0.223052 secs.

dynamicTreeCut-1

ov_df <- sim_ov %>%
    as.data.frame() %>%
    tibble::rownames_to_column() %>% 
    tidyr::pivot_longer(-rowname) %>%
    `names<-`(replace(names(.), 1:3, c("gs1", "gs2", "overlap")))

head(ov_df)
## # A tibble: 6 × 3
##   gs1                                                              gs2   overlap
##   <chr>                                                            <chr>   <dbl>
## 1 GOBP_INTRINSIC_APOPTOTIC_SIGNALING_PATHWAY_BY_P53_CLASS_MEDIATOR GOBP…  1     
## 2 GOBP_INTRINSIC_APOPTOTIC_SIGNALING_PATHWAY_BY_P53_CLASS_MEDIATOR GOBP…  0     
## 3 GOBP_INTRINSIC_APOPTOTIC_SIGNALING_PATHWAY_BY_P53_CLASS_MEDIATOR REAC…  0     
## 4 GOBP_INTRINSIC_APOPTOTIC_SIGNALING_PATHWAY_BY_P53_CLASS_MEDIATOR GOBP…  0.0127
## 5 GOBP_INTRINSIC_APOPTOTIC_SIGNALING_PATHWAY_BY_P53_CLASS_MEDIATOR REAC…  0.0380
## 6 GOBP_INTRINSIC_APOPTOTIC_SIGNALING_PATHWAY_BY_P53_CLASS_MEDIATOR GOBP…  0

Now let’s find the offending cluster and pick a biologically unrelated gene set:

subset(clusters_dtc, gs_name == "WP_MIRNA_BIOGENESIS")
##                 gs_name cluster
## 478 WP_MIRNA_BIOGENESIS      41
clust <- 29
subset(clusters_dtc, cluster == clust) %>%
    head()
##                                                                    gs_name
## 229 WP_ENVELOPE_PROTEINS_AND_THEIR_POTENTIAL_ROLES_IN_EDMD_PHYSIOPATHOLOGY
## 230                                            REACTOME_SIGNALLING_TO_ERKS
## 231                                                  BIOCARTA_CDK5_PATHWAY
## 232           GOBP_REGULATION_OF_EARLY_ENDOSOME_TO_LATE_ENDOSOME_TRANSPORT
## 233                                 WP_INTERLEUKIN1_IL1_STRUCTURAL_PATHWAY
## 234                 GOBP_REGULATION_OF_RECEPTOR_SIGNALING_PATHWAY_VIA_STAT
##     cluster
## 229      29
## 230      29
## 231      29
## 232      29
## 233      29
## 234      29
subset(ov_df, gs1 == "WP_MIRNA_BIOGENESIS" & gs2 == "REACTOME_TYROSINE_CATABOLISM")
## # A tibble: 0 × 3
## # ℹ 3 variables: gs1 <chr>, gs2 <chr>, overlap <dbl>

As we can see, those two gene sets, which are in the same cluster have no overlap.

Since this could simply be becasue there are stronger overlaps with the other gene sets in the cluster, I checked all pairwise similarity values for WP_MIRNA_BIOGENESIS in that cluster vs all clusters:

subset(ov_df, gs1 == "WP_MIRNA_BIOGENESIS" & gs2 %in% subset(clusters_dtc, cluster == 29)$gs_name)$overlap %>%
    summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0       0       0       0
subset(ov_df, gs1 == "WP_MIRNA_BIOGENESIS")$overlap %>%
    summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0055  0.0000  1.0000

The mean similarity values is higher in the cluster, but still quite low. But how low?

So I compared the mean similarity per cluster, and it seems like this cluster as the lowest mean similarity score of all clusters by some margin:

avg_scores <- sapply(1:max(clusters_dtc$cluster), function(x){
subset(ov_df, gs2 %in% subset(clusters_dtc, cluster == x)$gs_name)$overlap %>%
    mean()
    }
)

cl_score_df <- data.frame(
    cluster = 1:max(clusters_dtc$cluster),
    avg_scores
)
cl_score_df[order(cl_score_df$avg_scores), ] %>%
    knitr::kable()
cluster avg_scores
15 15 0.0080055
41 41 0.0096109
37 37 0.0103383
23 23 0.0126614
14 14 0.0129533
13 13 0.0133325
32 32 0.0145801
16 16 0.0149111
24 24 0.0149260
42 42 0.0151883
38 38 0.0157267
22 22 0.0158478
27 27 0.0161900
40 40 0.0162276
34 34 0.0163880
25 25 0.0164008
21 21 0.0167318
30 30 0.0177726
31 31 0.0179852
26 26 0.0199100
33 33 0.0201146
19 19 0.0202515
39 39 0.0204305
10 10 0.0204674
35 35 0.0214680
11 11 0.0227303
36 36 0.0231867
18 18 0.0235248
3 3 0.0241443
28 28 0.0243209
29 29 0.0251034
5 5 0.0253350
2 2 0.0257065
8 8 0.0280559
20 20 0.0291000
7 7 0.0293294
4 4 0.0320534
12 12 0.0331101
9 9 0.0363704
17 17 0.0396287
6 6 0.0675694
1 1 0.1610511

So it seems like the clustering methods, at least those tested end up putting a number of what should be singletons in a single cluster. Is there a way to alleviate this issue, or a clustering method (or similarity metric) which is less prone to this issue?

Cheers!