tjhwangxiong / TCGAplot

A number of functions were generated to perform pan-cancer DEG analysis, correlation analysis between gene expression and TMB, MSI, TIME, and promoter methylation. Methods for visualization were provided in order to easily perform integrative pan-cancer multi-omics analysis.
Other
64 stars 13 forks source link

Question about patient number #6

Closed Knight1995 closed 7 months ago

Knight1995 commented 8 months ago

Thanks for your great jobs! I have some confusion about the patient number.For example, in LIHC, your database includes 335 patients,but in Gepia, there are 364 patients. What is the reason? Thanks. image image

tjhwangxiong commented 8 months ago

Please refer to https://github.com/tjhwangxiong/TCGAplot/blob/main/rawdata_code/tpm.R and https://github.com/tjhwangxiong/TCGAplot/blob/main/rawdata_code/meta.R to see the data processing procedure.

tjhwangxiong commented 8 months ago

We have removed some duplicated samples.

Knight1995 commented 8 months ago

I double-checked my downloaded TCGA portal data, cbioportal data, and Gepia data.It shows about 360–370 patients rather than about 335 patients. Maybe you should double-check your downloaded data in LIHC? Anyway, thanks for your reply.

tjhwangxiong commented 8 months ago

meta=meta[!duplicated(meta$bcr_patient_barcode),] k1 = meta$time>=0.1 k2 = !(is.na(meta$time)|is.na(meta$event)) meta = meta[k1&k2,]

exprSet=dplyr::filter(tpm,Group=="Tumor")%>% tibble::add_column(ID = stringr::str_sub(rownames(.),1,12), .before="Cancer") %>% dplyr::filter(!duplicated(ID)) %>% tibble::remove_rownames(.) %>% tibble::column_to_rownames("ID")%>% dplyr::select(-(1:2))

s = intersect(rownames(meta),rownames(exprSet));length(s) meta = meta[s,]

We have removed duplicated barcodes, patients without survival time or the event was NA. Moreover, we intersected the sample names with expression matrix. Finally, the sample size with clinical information may be different with other online tools.

When you run get_cancers(), you will find that 50 normal and 374 tumors samples were included in the tpm expression matrix. When you run dim(get_meta("LIHC")), only 335 samples with sufficient clincial information was included in meta.

I hope to reply your inquiry.