kmeans clustering of gene expression

annashcherbina commented 7 years ago

K-means clustering yields some interesting results:

Plotting centroids of asinh(tpm) - sva for k=5,10,15,20,25, we see some interesting patterns emerging for k>=10, especially for the lateG1 timepoint:

asinhtpm_k5 asinhtpm_k10 asinhtpm_k15 asinhtpm_k20 asinhtpm_25

Looking at k=25, I performed (new) DAVID analysis of some interesting clusters:

genes up in all timepints with DMSO treatment

cluster22 cluster22 david

genes up in DMSO treatment in late SG2M

cluster4 cluster4 david

the converse -- genes up in Control in late SG2M

cluster23 cluster23 david

genes up in DMSO treatment in lateG1

cluster15 cluster15 david

the converse -- genes down in DMSO treatment in lateG1

cluster24 cluster24 david

None of the other clusters had significant DAVID associations. I think this is very exciting -- I tried to run the differential genes from limma analysis on asinh(tpm)-sva through DAVID and did not get significant associations. It is encouraging to see these in the k-means clustering.

annashcherbina commented 7 years ago

However, trying the same approach for rlogTransform(counts) did not yield interesting clusters: rlog_k5 rlog_k15 rlog_k25

I think the problem is that the transform seems to be "squishing" all the expression values towards the mean. This explains the Volcano plots & low numbers of differential genes from the limma analysis when rlogTransform(counts) was used as the input.

annashcherbina commented 7 years ago

Adding more clusters (k=30) highlights some interesting details -- we see 37 MIR's & SNORA's up strongly in lateG1 in the controls but not in DMSO treatment:

mir_snora

Also, a highly significant (FDR ~ 1e-8) phosphoprotein group cluster:

cluster36 david

@akundaje -- it seems that the k-means clustering approach is proving a lot more fruitful than the various tweaks on limma/DESEQ2 -- should I focus on the k-means analysis to identify differential genes/ pathways ?

akundaje commented 7 years ago

You need both. If you cluster the differential effects you will likely see interesting things pop as well.

Btw what you should do is run k-means with large k eg. 100 and then use hierarchical clustering on the centroids with a tight similarity threshold to merge centroids to a non-redundant set.

-Anshul.

On Thu, Apr 27, 2017 at 10:00 PM, annashcherbina notifications@github.com wrote:

Adding more clusters (k=30) highlights some interesting details -- we see 37 MIR's & SNORA's up strongly in lateG1 in the controls but not in DMSO treatment:

[image: 10] https://cloud.githubusercontent.com/assets/5261545/25514440/f16e6184-2b91-11e7-9372-cbe7f160e006.png [image: mir_snora] https://cloud.githubusercontent.com/assets/5261545/25514501/4fb9ef9c-2b92-11e7-8b57-3522db1e6e07.png

Also, a highly significant (FDR ~ 1e-8) phosphoprotein group cluster: [image: 36] https://cloud.githubusercontent.com/assets/5261545/25514850/ae5234f4-2b94-11e7-8f3b-26110fd64572.png

[image: cluster36 david] https://cloud.githubusercontent.com/assets/5261545/25514861/bd47b4b6-2b94-11e7-9b8a-fd5d53310589.png

@akundaje https://github.com/akundaje -- it seems that the k-means clustering approach is proving a lot more fruitful than the various tweaks on limma/DESEQ2 -- should I focus on the k-means analysis to identify differential genes/ pathways ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kundajelab/DMSO/issues/7#issuecomment-297907436, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EeS8ryNyoqGRlmspj3_EWnBjEQkuks5r0XJ_gaJpZM4NLCCw .