Validate that correlation modules are biologically relevant

greenelab / core-accessory-interactome

Investigating the functional relationship between P. aeruginosa core and accessory genes.

BSD 3-Clause "New" or "Revised" License

1 stars 1 forks source link

This PR updates the validation of the correlation modules generated in a few ways:

Examines the coverage of co-operonic/regulonic genes across modules. We would expect that the probability that a pair of genes is in the same module given they are in the same regulon to be higher using the true module labels compared to the shuffled labels, which we see below.
We compared the composition of modules across the array and RNA-seq compendia. We would expect consistency, which we do see based on the distribution of p-values that indicate how well modules from the array compendium mapped to the RNA-seq compendium. Here the distribution of p-values is lower using the true module labels compared to the shuffled labels.
We looked for enrichment of modules in KEGG pathways. We would expect some modules to correlate with KEGG pathways. The result we got isn't what I would expect, maybe this isn't the best way to assess this though.

Cool results! Sorry for the late review, I totally missed this notification yesterday.

A few comments:

To what extent does clustering reflect the biology of PA gene expression? Does the assumption that each gene should live in one module hold? (This isn't a criticism, I legitimately have no idea what the structure of bacterial gene expression regulation looks like/how strong regulons are as compared to operons).

Very good question! There is this other method that weights a gene's contribution to different modules so that genes are not exclusively belonginging to a single cluster, which I think is a simplified assumption that people make for these types of analyses. https://arxiv.org/abs/2106.00657

We were considering using this method for our work, but it requires integer values as input right now, but looks like they are planning to extend this soon.

If you're looking for another clustering method at some point, spectral clustering might be a good fit. It thinks about clustering as cutting up a similarity network, which at least aesthetically is similar to trying to partition a gene coexpression network

Thanks! I'll make a note of that

greenelab / core-accessory-interactome

Validate that correlation modules are biologically relevant #37