sunduanchen / Scissor

Scissor package
GNU General Public License v3.0
168 stars 29 forks source link

The percentage of selected cell is: 0.000% #42

Open hyjforesight opened 1 year ago

hyjforesight commented 1 year ago

Hello Scissor, I'm running Scissor, but no matter what alpha I used, it always came out 0 selected cells. Could you please help me with this issue? Thanks! Best, Yuanjian

infos <- Scissor_change(bulk_dataset=GeneMatrix_HGNC_input, sc_dataset=Merge2.combined, phenotype=phenotype, tag = tag, alpha = seq(1,5,1)/10000000, cutoff = 0.01, family = "binomial", Save_file = "Merge2_Scissor.RData")
[1] "|**************************************************|"
[1] "Performing quality-check for the correlations"
[1] "The five-number summary of correlations:"
         0%         25%         50%         75%        100% 
-0.04986036  0.14333164  0.23372394  0.35075314  0.79686469 
[1] "|**************************************************|"
[1] "Current phenotype contains 48 Wild type and 17 Mutated samples."
[1] "Perform logistic regression on the given phenotypes:"
[1] "alpha = 1e-07"
[1] "Scissor identified 0 Scissor+ cells and 0 Scissor- cells."
[1] "The percentage of selected cell is: 0.000%"
sunduanchen commented 1 year ago

Hi Yuanjian,

I may need your data to debug it. Could you please send me your data to @.***? For data privacy, you can send me a subset of data with this problem.

Thanks, Duanchen

On Tue, 13 Dec 2022 at 09:46, Yuanjian Huang @.***> wrote:

Hello Scissor, I'm running Scissor, but no matter what alpha I used, it always came out 0 selected cells. Could you please help me with this issue? Thanks! Best, Yuanjian

infos <- Scissor_change(bulk_dataset=GeneMatrix_HGNC_input, sc_dataset=Merge2.combined, phenotype=phenotype, tag = tag, alpha = seq(1,5,1)/10000000, cutoff = 0.01, family = "binomial", Save_file = "Merge2_Scissor.RData") [1] "|**|" [1] "Performing quality-check for the correlations" [1] "The five-number summary of correlations:" 0% 25% 50% 75% 100% -0.04986036 0.14333164 0.23372394 0.35075314 0.79686469 [1] "|**|" [1] "Current phenotype contains 48 Wild type and 17 Mutated samples." [1] "Perform logistic regression on the given phenotypes:" [1] "alpha = 1e-07" [1] "Scissor identified 0 Scissor+ cells and 0 Scissor- cells." [1] "The percentage of selected cell is: 0.000%"

— Reply to this email directly, view it on GitHub https://github.com/sunduanchen/Scissor/issues/42, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYQWDEAUKAV7XE5PMTAUILWM7IQJANCNFSM6AAAAAAS4T6KVU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

hyjforesight commented 1 year ago

Hello Duanchen, Thanks for the response. However, your email address was not shown in the reply. Now, I found 2 genes that cannot be computed Scissor selected cells in my case. They're TP53 and MSH6. I want a cutoff of 0.2 for both. For MSH6, a=0.00001, Scissor selected 82.336% cells; a=0.000011, 81.302%; a=0.000012-0.00005, 0. For TP53, a>0.0000001, 0. Here I shared the link for downloading my data and the code I used. Thanks for the debug!

Data on Google drive: https://drive.google.com/file/d/1ZWuyQD7LVIfS1DBHDjdoP5-ZoAo16NBY/view?usp=sharing (2.1GB) 如果你在国内的话,我可以上传到百度网盘。

Sys.setenv(LANG = "en_US")
library(Seurat)
library(ggplot2)
library(sctransform)
library(dplyr)
library(patchwork)
library(SeuratDisk)
library(TCGAbiolinks)
library(SummarizedExperiment)
library(DT)
library(biomaRt)
library(Scissor)
memory.limit(512000)
options(future.globals.maxSize = 256000*1024^2)

# 1. Get the patient list with the interested gene mutated
# download TCGA CNV information
query_1 <- GDCquery(project = "TCGA-COAD", data.category = "Simple Nucleotide Variation", data.type = "Masked Somatic Mutation", workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking", legacy = FALSE, access= 'open', data.format='TXT')
GDCdownload(query_1)
CNV <- GDCprepare(query=query_1, save=TRUE, save.filename='C:/Users/Park_Lab/Documents/TCGA_COAD_CNV.rda')
write.csv(CNV, file="C:/Users/Park_Lab/Documents/TCGA_COAD_CNV.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# retrieve the patient list with the interested gene mutated
mutation_df <- CNV[CNV$Hugo_Symbol == 'MSH6',]    # subset CNV table into a table with only interested gene mutated
write.csv(mutation_df, file="C:/Users/Park_Lab/Documents/mutation_df.csv", sep = ",", row.names=TRUE, col.names = TRUE)

mutation_patient=as.data.frame(mutation_df$Tumor_Sample_Barcode)    # use column `Tumor_Sample_Barcode` to build a new data frame
colnames(mutation_patient) <- c("Patient_ID")    # rename the name of column 0 to Patient_ID
mutation_patient$Patient_ID <- substr(mutation_patient$Patient_ID, 0, 12)    # only keep the first 12 character as Patient_ID, these patients are with interested gene mutated, note that this patient list includes all cancer types
mutation_patient$Status <- 1    # set 1 as the mutation status
write.csv(mutation_patient, file="C:/Users/Park_Lab/Documents/mutation_patient_list.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# 2. Get the patient list with interested pathological type
# download TCGA-COAD clinical index data
clinical_index <- GDCquery_clinic(project = "TCGA-COAD", type = "Clinical", save.csv = TRUE)    # will generate TCGA-COAD_clinical.csv automatically

# retrieve the patient list with the interested pathological type
disease_clinical <- clinical_index %>% filter(primary_diagnosis=="Mucinous adenocarcinoma")    # only keep the mucinous cancer patients clinical data
disease_patient=as.data.frame(disease_clinical$submitter_id)    # use column `submitter_id` to build a new data frame
colnames(disease_patient) <- c("Patient_ID")    # rename the name of column 0 to Patient_ID, note that this patient list includes all mucinous cancer with or without interested gene mutated 
write.csv(disease_patient, file="C:/Users/Park_Lab/Documents/disease_patient_list1.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# 3. Prepare phenotype list (For patient diagnosed with one pathological type, set mutated ones as 1, wild type as 0)
disease_patient$Status=mutation_patient$Status[match(disease_patient$Patient_ID, mutation_patient$Patient_ID)]    # copy mutation_patient status (1) to disease_patient if disease_patient also exist in mutation_patient data frame
disease_patient[is.na(disease_patient)] = 0    # set left patient status as 0, indicating wild type
write.csv(disease_patient, file="C:/Users/Park_Lab/Documents/disease_patient_list2.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# 4. Prepare human gene symbol annotated TCGA-COAD RNA-seq gene matrix 
# download TCGA-COAD gene matrix
query_2 <- GDCquery(project = "TCGA-COAD", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "STAR - Counts", legacy = FALSE, access= 'open', data.format='TXT', experimental.strategy='RNA-Seq', sample.type=c('Primary Tumor'))    # didn't download 'Solid Tissue Normal' data, because this script is for tumor CNV
GDCdownload(query_2)
RNA_seq <- GDCprepare(query=query_2, save=TRUE, save.filename='C:/Users/Park_Lab/Documents/TCGA_COAD_RNA_seq.rda')
GeneMatrix <- assay(RNA_seq)
write.csv(GeneMatrix, file="C:/Users/Park_Lab/Documents/TCGA_COAD_GeneMatrix.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# convert human ensembl_id of gene matrix to gene_symbol
human <- useEnsembl(biomart = "genes", dataset = "hsapiens_gene_ensembl")
ensembl_id <- rownames(GeneMatrix)
gene_symbol <- getBM(attributes = c('ensembl_gene_id_version', 'hgnc_symbol'), filters = 'ensembl_gene_id_version', values = ensembl_id, mart = human)
head(gene_symbol)

gene_symbol <- gene_symbol[!(gene_symbol$hgnc_symbol==""),]    # remove rows with empty human gene
gene_symbol <- gene_symbol[!duplicated(gene_symbol$ensembl_gene_id_version),]    # remove rows with duplicated human ensembl
gene_symbol <- gene_symbol[!duplicated(gene_symbol$hgnc_symbol),]    # remove rows with duplicated human gene

class(GeneMatrix)
GeneMatrix_df<- as.data.frame(GeneMatrix)
class(GeneMatrix_df)

GeneMatrix_df$genes <- gene_symbol$hgnc_symbol[match(rownames(GeneMatrix_df), gene_symbol$ensembl_gene_id_version)]    # copy gene symbols to gene expression matrix, matching human ensembl id
GeneMatrix_df <- na.omit(GeneMatrix_df)    # remove rows with 'NA' in 'genes' column
rownames(GeneMatrix_df) <- GeneMatrix_df$genes    # set 'genes' as rownames
GeneMatrix_HGNC = GeneMatrix_df[,!(colnames(GeneMatrix_df) %in% c('genes'))]    # remove 'genes' column
write.csv(GeneMatrix_HGNC, file="C:/Users/Park_Lab/Documents/GeneMatrix_HGNC.csv", sep = ",", row.names=TRUE, col.names = TRUE)    # gene matrix with gene_symbol

# 5. Match gene matrix and phenotype list
# subset gene matrix into the patients that only exist in the phenotype list
colnames(GeneMatrix_HGNC) <- substr(colnames(GeneMatrix_HGNC), 0, 12)    # only keep the first 12 character of sample ID to for matching with Patient_ID later
GeneMatrix_HGNC <- GeneMatrix_HGNC[, !duplicated(colnames(GeneMatrix_HGNC))]    # 1 patient may have >1 samples, after simplify sample ID, there will be duplicates, but phenotype list has no duplicates, so we have to remove these sample duplicates
GeneMatrix_HGNC <- subset(GeneMatrix_HGNC, select = colnames(GeneMatrix_HGNC) %in% disease_patient$Patient_ID)    # only include patients that also exist in the phenotype list
write.csv(GeneMatrix_HGNC, file="C:/Users/Park_Lab/Documents/GeneMatrix_HGNC_disease.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# make gene matrix and phenotype list the same order
GeneMatrix_HGNC_sorted <- GeneMatrix_HGNC[ , order(match(colnames(GeneMatrix_HGNC), disease_patient$Patient_ID))]    # sort gene matrix the same order as patient ids of phenotype list
all(colnames(GeneMatrix_HGNC_sorted) == disease_patient$Patient_ID)    # check whether the orders of sample ids in gene matrix and phenotype list are the same
write.csv(GeneMatrix_HGNC_sorted, file="C:/Users/Park_Lab/Documents/GeneMatrix_HGNC_disease_sorted.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# make the sample id annotated numeric phenotype list
class(disease_patient)
rownames(disease_patient) <- disease_patient$Patient_ID    # set Patient_ID as data frame index
disease_patient_input = disease_patient[,!(colnames(disease_patient) %in% c('Patient_ID'))]    # remove column 'Patient_ID'
disease_patient_input    # disease_patient becomes numeric, only the values of column 'Status' will show
names(disease_patient_input) = rownames(disease_patient)    # copy the rownames back
disease_patient_input
class(disease_patient_input)
table(disease_patient_input)    # show statistics of status 0 and 1
write.csv(disease_patient_input, file="C:/Users/Park_Lab/Documents/disease_patient_input.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# 6. Prepare scRNA-seq data
Merge2.combined <- readRDS(file = "C:/Users/Park_Lab/Documents/Merge2.combined_all.rds")

# 7. Run Scissor
phenotype <- disease_patient_input
tag <- c('Wild type', 'Mutated')    # name the phenotype 0 (Wild type) first, then phenotype 1 (Mutated)
GeneMatrix_HGNC_input  <- as.matrix(GeneMatrix_HGNC_sorted)    # convert gene matrix (data.frame) to 'matrix'
# data is too big, the original Scissor() got errors. To fix this error, run the code of Scissor_change.R and as_matrix.R first, then run below steps
infos <- Scissor_change(bulk_dataset=GeneMatrix_HGNC_input, sc_dataset=Merge2.combined, phenotype=phenotype, tag = tag, alpha = 0.000012, cutoff = 0.01, family = "binomial", Save_file = "Merge2_Scissor.RData")
# MSH6 0.00001, 82.336%; 0.000011, 81.302%; 0.000012-0.00005, 0;
# TP53 >0.0000001, 0
Expectolive commented 9 months ago

Hello Duanchen, Thanks for the response. However, your email address was not shown in the reply. Now, I found 2 genes that cannot be computed Scissor selected cells in my case. They're TP53 and MSH6. I want a cutoff of 0.2 for both. For MSH6, a=0.00001, Scissor selected 82.336% cells; a=0.000011, 81.302%; a=0.000012-0.00005, 0. For TP53, a>0.0000001, 0. Here I shared the link for downloading my data and the code I used. Thanks for the debug!

Data on Google drive: https://drive.google.com/file/d/1ZWuyQD7LVIfS1DBHDjdoP5-ZoAo16NBY/view?usp=sharing (2.1GB) 如果你在国内的话,我可以上传到百度网盘。

Sys.setenv(LANG = "en_US")
library(Seurat)
library(ggplot2)
library(sctransform)
library(dplyr)
library(patchwork)
library(SeuratDisk)
library(TCGAbiolinks)
library(SummarizedExperiment)
library(DT)
library(biomaRt)
library(Scissor)
memory.limit(512000)
options(future.globals.maxSize = 256000*1024^2)

# 1. Get the patient list with the interested gene mutated
# download TCGA CNV information
query_1 <- GDCquery(project = "TCGA-COAD", data.category = "Simple Nucleotide Variation", data.type = "Masked Somatic Mutation", workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking", legacy = FALSE, access= 'open', data.format='TXT')
GDCdownload(query_1)
CNV <- GDCprepare(query=query_1, save=TRUE, save.filename='C:/Users/Park_Lab/Documents/TCGA_COAD_CNV.rda')
write.csv(CNV, file="C:/Users/Park_Lab/Documents/TCGA_COAD_CNV.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# retrieve the patient list with the interested gene mutated
mutation_df <- CNV[CNV$Hugo_Symbol == 'MSH6',]    # subset CNV table into a table with only interested gene mutated
write.csv(mutation_df, file="C:/Users/Park_Lab/Documents/mutation_df.csv", sep = ",", row.names=TRUE, col.names = TRUE)

mutation_patient=as.data.frame(mutation_df$Tumor_Sample_Barcode)    # use column `Tumor_Sample_Barcode` to build a new data frame
colnames(mutation_patient) <- c("Patient_ID")    # rename the name of column 0 to Patient_ID
mutation_patient$Patient_ID <- substr(mutation_patient$Patient_ID, 0, 12)    # only keep the first 12 character as Patient_ID, these patients are with interested gene mutated, note that this patient list includes all cancer types
mutation_patient$Status <- 1    # set 1 as the mutation status
write.csv(mutation_patient, file="C:/Users/Park_Lab/Documents/mutation_patient_list.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# 2. Get the patient list with interested pathological type
# download TCGA-COAD clinical index data
clinical_index <- GDCquery_clinic(project = "TCGA-COAD", type = "Clinical", save.csv = TRUE)    # will generate TCGA-COAD_clinical.csv automatically

# retrieve the patient list with the interested pathological type
disease_clinical <- clinical_index %>% filter(primary_diagnosis=="Mucinous adenocarcinoma")    # only keep the mucinous cancer patients clinical data
disease_patient=as.data.frame(disease_clinical$submitter_id)    # use column `submitter_id` to build a new data frame
colnames(disease_patient) <- c("Patient_ID")    # rename the name of column 0 to Patient_ID, note that this patient list includes all mucinous cancer with or without interested gene mutated 
write.csv(disease_patient, file="C:/Users/Park_Lab/Documents/disease_patient_list1.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# 3. Prepare phenotype list (For patient diagnosed with one pathological type, set mutated ones as 1, wild type as 0)
disease_patient$Status=mutation_patient$Status[match(disease_patient$Patient_ID, mutation_patient$Patient_ID)]    # copy mutation_patient status (1) to disease_patient if disease_patient also exist in mutation_patient data frame
disease_patient[is.na(disease_patient)] = 0    # set left patient status as 0, indicating wild type
write.csv(disease_patient, file="C:/Users/Park_Lab/Documents/disease_patient_list2.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# 4. Prepare human gene symbol annotated TCGA-COAD RNA-seq gene matrix 
# download TCGA-COAD gene matrix
query_2 <- GDCquery(project = "TCGA-COAD", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "STAR - Counts", legacy = FALSE, access= 'open', data.format='TXT', experimental.strategy='RNA-Seq', sample.type=c('Primary Tumor'))    # didn't download 'Solid Tissue Normal' data, because this script is for tumor CNV
GDCdownload(query_2)
RNA_seq <- GDCprepare(query=query_2, save=TRUE, save.filename='C:/Users/Park_Lab/Documents/TCGA_COAD_RNA_seq.rda')
GeneMatrix <- assay(RNA_seq)
write.csv(GeneMatrix, file="C:/Users/Park_Lab/Documents/TCGA_COAD_GeneMatrix.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# convert human ensembl_id of gene matrix to gene_symbol
human <- useEnsembl(biomart = "genes", dataset = "hsapiens_gene_ensembl")
ensembl_id <- rownames(GeneMatrix)
gene_symbol <- getBM(attributes = c('ensembl_gene_id_version', 'hgnc_symbol'), filters = 'ensembl_gene_id_version', values = ensembl_id, mart = human)
head(gene_symbol)

gene_symbol <- gene_symbol[!(gene_symbol$hgnc_symbol==""),]    # remove rows with empty human gene
gene_symbol <- gene_symbol[!duplicated(gene_symbol$ensembl_gene_id_version),]    # remove rows with duplicated human ensembl
gene_symbol <- gene_symbol[!duplicated(gene_symbol$hgnc_symbol),]    # remove rows with duplicated human gene

class(GeneMatrix)
GeneMatrix_df<- as.data.frame(GeneMatrix)
class(GeneMatrix_df)

GeneMatrix_df$genes <- gene_symbol$hgnc_symbol[match(rownames(GeneMatrix_df), gene_symbol$ensembl_gene_id_version)]    # copy gene symbols to gene expression matrix, matching human ensembl id
GeneMatrix_df <- na.omit(GeneMatrix_df)    # remove rows with 'NA' in 'genes' column
rownames(GeneMatrix_df) <- GeneMatrix_df$genes    # set 'genes' as rownames
GeneMatrix_HGNC = GeneMatrix_df[,!(colnames(GeneMatrix_df) %in% c('genes'))]    # remove 'genes' column
write.csv(GeneMatrix_HGNC, file="C:/Users/Park_Lab/Documents/GeneMatrix_HGNC.csv", sep = ",", row.names=TRUE, col.names = TRUE)    # gene matrix with gene_symbol

# 5. Match gene matrix and phenotype list
# subset gene matrix into the patients that only exist in the phenotype list
colnames(GeneMatrix_HGNC) <- substr(colnames(GeneMatrix_HGNC), 0, 12)    # only keep the first 12 character of sample ID to for matching with Patient_ID later
GeneMatrix_HGNC <- GeneMatrix_HGNC[, !duplicated(colnames(GeneMatrix_HGNC))]    # 1 patient may have >1 samples, after simplify sample ID, there will be duplicates, but phenotype list has no duplicates, so we have to remove these sample duplicates
GeneMatrix_HGNC <- subset(GeneMatrix_HGNC, select = colnames(GeneMatrix_HGNC) %in% disease_patient$Patient_ID)    # only include patients that also exist in the phenotype list
write.csv(GeneMatrix_HGNC, file="C:/Users/Park_Lab/Documents/GeneMatrix_HGNC_disease.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# make gene matrix and phenotype list the same order
GeneMatrix_HGNC_sorted <- GeneMatrix_HGNC[ , order(match(colnames(GeneMatrix_HGNC), disease_patient$Patient_ID))]    # sort gene matrix the same order as patient ids of phenotype list
all(colnames(GeneMatrix_HGNC_sorted) == disease_patient$Patient_ID)    # check whether the orders of sample ids in gene matrix and phenotype list are the same
write.csv(GeneMatrix_HGNC_sorted, file="C:/Users/Park_Lab/Documents/GeneMatrix_HGNC_disease_sorted.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# make the sample id annotated numeric phenotype list
class(disease_patient)
rownames(disease_patient) <- disease_patient$Patient_ID    # set Patient_ID as data frame index
disease_patient_input = disease_patient[,!(colnames(disease_patient) %in% c('Patient_ID'))]    # remove column 'Patient_ID'
disease_patient_input    # disease_patient becomes numeric, only the values of column 'Status' will show
names(disease_patient_input) = rownames(disease_patient)    # copy the rownames back
disease_patient_input
class(disease_patient_input)
table(disease_patient_input)    # show statistics of status 0 and 1
write.csv(disease_patient_input, file="C:/Users/Park_Lab/Documents/disease_patient_input.csv", sep = ",", row.names=TRUE, col.names = TRUE)

# 6. Prepare scRNA-seq data
Merge2.combined <- readRDS(file = "C:/Users/Park_Lab/Documents/Merge2.combined_all.rds")

# 7. Run Scissor
phenotype <- disease_patient_input
tag <- c('Wild type', 'Mutated')    # name the phenotype 0 (Wild type) first, then phenotype 1 (Mutated)
GeneMatrix_HGNC_input  <- as.matrix(GeneMatrix_HGNC_sorted)    # convert gene matrix (data.frame) to 'matrix'
# data is too big, the original Scissor() got errors. To fix this error, run the code of Scissor_change.R and as_matrix.R first, then run below steps
infos <- Scissor_change(bulk_dataset=GeneMatrix_HGNC_input, sc_dataset=Merge2.combined, phenotype=phenotype, tag = tag, alpha = 0.000012, cutoff = 0.01, family = "binomial", Save_file = "Merge2_Scissor.RData")
# MSH6 0.00001, 82.336%; 0.000011, 81.302%; 0.000012-0.00005, 0;
# TP53 >0.0000001, 0

Hi, I met the same problem that Scissor turned out to select 0.000%. Could I ask you how you work this out?