kundajelab / DMSO

5 stars 3 forks source link

kmeans clustering of gene expression #7

Open annashcherbina opened 7 years ago

annashcherbina commented 7 years ago

K-means clustering yields some interesting results:

Plotting centroids of asinh(tpm) - sva for k=5,10,15,20,25, we see some interesting patterns emerging for k>=10, especially for the lateG1 timepoint:

asinhtpm_k5 asinhtpm_k10 asinhtpm_k15 asinhtpm_k20 asinhtpm_25

Looking at k=25, I performed (new) DAVID analysis of some interesting clusters:

genes up in all timepints with DMSO treatment

cluster22 cluster22 david

genes up in DMSO treatment in late SG2M

cluster4 cluster4 david

the converse -- genes up in Control in late SG2M

cluster23 cluster23 david

genes up in DMSO treatment in lateG1

cluster15 cluster15 david

the converse -- genes down in DMSO treatment in lateG1

cluster24 cluster24 david

None of the other clusters had significant DAVID associations. I think this is very exciting -- I tried to run the differential genes from limma analysis on asinh(tpm)-sva through DAVID and did not get significant associations. It is encouraging to see these in the k-means clustering.

annashcherbina commented 7 years ago

However, trying the same approach for rlogTransform(counts) did not yield interesting clusters: rlog_k5 rlog_k15 rlog_k25

I think the problem is that the transform seems to be "squishing" all the expression values towards the mean. This explains the Volcano plots & low numbers of differential genes from the limma analysis when rlogTransform(counts) was used as the input.

annashcherbina commented 7 years ago

Adding more clusters (k=30) highlights some interesting details -- we see 37 MIR's & SNORA's up strongly in lateG1 in the controls but not in DMSO treatment:

10 mir_snora

Also, a highly significant (FDR ~ 1e-8) phosphoprotein group cluster: 36

cluster36 david

@akundaje -- it seems that the k-means clustering approach is proving a lot more fruitful than the various tweaks on limma/DESEQ2 -- should I focus on the k-means analysis to identify differential genes/ pathways ?

akundaje commented 7 years ago

You need both. If you cluster the differential effects you will likely see interesting things pop as well.

Btw what you should do is run k-means with large k eg. 100 and then use hierarchical clustering on the centroids with a tight similarity threshold to merge centroids to a non-redundant set.

-Anshul.

On Thu, Apr 27, 2017 at 10:00 PM, annashcherbina notifications@github.com wrote:

Adding more clusters (k=30) highlights some interesting details -- we see 37 MIR's & SNORA's up strongly in lateG1 in the controls but not in DMSO treatment:

[image: 10] https://cloud.githubusercontent.com/assets/5261545/25514440/f16e6184-2b91-11e7-9372-cbe7f160e006.png [image: mir_snora] https://cloud.githubusercontent.com/assets/5261545/25514501/4fb9ef9c-2b92-11e7-8b57-3522db1e6e07.png

Also, a highly significant (FDR ~ 1e-8) phosphoprotein group cluster: [image: 36] https://cloud.githubusercontent.com/assets/5261545/25514850/ae5234f4-2b94-11e7-8f3b-26110fd64572.png

[image: cluster36 david] https://cloud.githubusercontent.com/assets/5261545/25514861/bd47b4b6-2b94-11e7-9b8a-fd5d53310589.png

@akundaje https://github.com/akundaje -- it seems that the k-means clustering approach is proving a lot more fruitful than the various tweaks on limma/DESEQ2 -- should I focus on the k-means analysis to identify differential genes/ pathways ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kundajelab/DMSO/issues/7#issuecomment-297907436, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EeS8ryNyoqGRlmspj3_EWnBjEQkuks5r0XJ_gaJpZM4NLCCw .

annashcherbina commented 7 years ago

Ok, sounds good. Going forward, I will use the limma(asinh(tpm)-sva) and kmeans(asinh(tpm)-sva) approaches for the analysis.

annashcherbina commented 7 years ago

Analysis of differential genes from limma , with and without subsequent clustering:

early G1 down in DMSO treated samples

earlyg1 down

earlyg1 down 3

early G1 Up in DMSO treated

No clustering: earlyg1 up

After clustering:

earlyg1 up earlyg1 up 1

lateG1 down with DMSO treatment

lateg1 down

after clustering: lateg1 down

lateg1 down 2

lateg1 down 3

lateG1 up with DMSO treatment

lateg1 up after clutering: lateg1 up

SG2M down with DMSO treatment

sg2m down

sg2m down 3

SG2M up with DMSO treatment

sg2m up

after clustering: sg2m up sg2m up 1 sg2m up 2 sg2m up 3

Note: the outliers on the dendrograms are all RNA's , the same ones as noted in the heatmap above.