Open annashcherbina opened 7 years ago
However, trying the same approach for rlogTransform(counts) did not yield interesting clusters:
I think the problem is that the transform seems to be "squishing" all the expression values towards the mean. This explains the Volcano plots & low numbers of differential genes from the limma analysis when rlogTransform(counts) was used as the input.
Adding more clusters (k=30) highlights some interesting details -- we see 37 MIR's & SNORA's up strongly in lateG1 in the controls but not in DMSO treatment:
Also, a highly significant (FDR ~ 1e-8) phosphoprotein group cluster:
@akundaje -- it seems that the k-means clustering approach is proving a lot more fruitful than the various tweaks on limma/DESEQ2 -- should I focus on the k-means analysis to identify differential genes/ pathways ?
You need both. If you cluster the differential effects you will likely see interesting things pop as well.
Btw what you should do is run k-means with large k eg. 100 and then use hierarchical clustering on the centroids with a tight similarity threshold to merge centroids to a non-redundant set.
-Anshul.
On Thu, Apr 27, 2017 at 10:00 PM, annashcherbina notifications@github.com wrote:
Adding more clusters (k=30) highlights some interesting details -- we see 37 MIR's & SNORA's up strongly in lateG1 in the controls but not in DMSO treatment:
[image: 10] https://cloud.githubusercontent.com/assets/5261545/25514440/f16e6184-2b91-11e7-9372-cbe7f160e006.png [image: mir_snora] https://cloud.githubusercontent.com/assets/5261545/25514501/4fb9ef9c-2b92-11e7-8b57-3522db1e6e07.png
Also, a highly significant (FDR ~ 1e-8) phosphoprotein group cluster: [image: 36] https://cloud.githubusercontent.com/assets/5261545/25514850/ae5234f4-2b94-11e7-8f3b-26110fd64572.png
[image: cluster36 david] https://cloud.githubusercontent.com/assets/5261545/25514861/bd47b4b6-2b94-11e7-9b8a-fd5d53310589.png
@akundaje https://github.com/akundaje -- it seems that the k-means clustering approach is proving a lot more fruitful than the various tweaks on limma/DESEQ2 -- should I focus on the k-means analysis to identify differential genes/ pathways ?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kundajelab/DMSO/issues/7#issuecomment-297907436, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EeS8ryNyoqGRlmspj3_EWnBjEQkuks5r0XJ_gaJpZM4NLCCw .
Ok, sounds good. Going forward, I will use the limma(asinh(tpm)-sva) and kmeans(asinh(tpm)-sva) approaches for the analysis.
Analysis of differential genes from limma , with and without subsequent clustering:
No clustering:
After clustering:
after clustering:
after clutering:
after clustering:
Note: the outliers on the dendrograms are all RNA's , the same ones as noted in the heatmap above.
K-means clustering yields some interesting results:
Plotting centroids of asinh(tpm) - sva for k=5,10,15,20,25, we see some interesting patterns emerging for k>=10, especially for the lateG1 timepoint:
Looking at k=25, I performed (new) DAVID analysis of some interesting clusters:
genes up in all timepints with DMSO treatment
genes up in DMSO treatment in late SG2M
the converse -- genes up in Control in late SG2M
genes up in DMSO treatment in lateG1
the converse -- genes down in DMSO treatment in lateG1
None of the other clusters had significant DAVID associations. I think this is very exciting -- I tried to run the differential genes from limma analysis on asinh(tpm)-sva through DAVID and did not get significant associations. It is encouraging to see these in the k-means clustering.