Clustering - Githubissues

Based on your R-code, you are using count data; please use normalized data which can have an large impact on how objects cluster. E. g. a common normalization procedure performed by limma voom is to scale every count of a sample by the total library size: suppose that in sample 1 CLCA1 had 3000 counts and the total library size was 1M counts. Now suppose that in sample 2, CLCA1 had 6000 counts and the total library size was 2M. Essentially the other sample was sequenced more so the all genes have twice as many reads than sample 1. Therefore, after normalization both genes are equally expressed --> normalization is very important.
NOTE: normalize you entire dataset first then take out the expression data for CLCA1, SERPINB2 and periostin (don’t normalize them independently)
You have only a few samples with very high counts that is why your Th2 high group has very few subjects. log2 your data (limma voom also does this: look into the voom function from limma)
Lastly the scale of the genes has a large impact on how objects cluster. After you normalize your data --> standardize your genes (center and scale)
Then perform k-means clustering
Lastly, since you only have 3 genes, try making a 3d plot of your clusters where each axis is a gene.

santina / team_Undecided