peterawe / CMScaller

GNU General Public License v3.0
26 stars 17 forks source link

Problem with loading data #4

Closed amphun closed 4 years ago

amphun commented 4 years ago

Hi I have 2 main problems

Reading ids - I tried so many ids:entrez, Ensembl and HGNC Here is some error msg. "136/748 rownames(emat) failed to match to human gene identifiers"

res<-CMScaller(X_TestCMS_nEnsembl,rowNames = "ensg") 770/770 rownames [NA.number] (no valid translation) 0/770 rownames [id.number] (translation gives duplicates) Error in .rowNamesDF<-(x, value = value) : missing values in 'row.names' are not allowed

Second issue related to how to prepare data like an example - crcTCGA file. I have file in csv or excel file with first column= gene ids and the rest are experiment. I attached with this msg. I test with round the number. with no replication but nothing work for me. TestCMS.xlsx

Thank you in advance for your advice.

peterawe commented 4 years ago

Hi,

I tested your data using the code below. You should see two plots and res holds the predictions. If you're unable to get it working, let me know. PS: Please be advised that with fewer than 30-40 samples the predictions can be unstable as stated in our paper.

Best, Peter

## load data, ensure unique ,non-NA rownames and make expression matrix with ids as rownnames
# install.packages("readxl") # install if needed
data <- readxl::read_xlsx("TestCMS.xlsx")
ensg <- data$`Ensembl Gene ID`
keep <- !(is.na(ensg) | duplicated(ensg))
emat <- as.matrix(data[keep,-1])
rownames(emat) <- ensg[keep]
## see how input should be formatted
head(emat[,1:5])
## make prediction
res <- CMScaller::CMScaller(emat, rowNames = 'ensg', RNAseq=TRUE)
## see gene set associations
emat_entrez <- CMScaller::replaceGeneId(emat, id.in='ensg', id.out="entrez")
CMScaller::CMSgsa(emat_entrez, class = res$prediction)
amphun commented 4 years ago

Thanks so much. It's working. This is just a test set. We will add more samples later. Thanks again for your support and suggestion.

Kind regards, Amphun