Closed Knight1995 closed 7 months ago
Please refer to https://github.com/tjhwangxiong/TCGAplot/blob/main/rawdata_code/tpm.R and https://github.com/tjhwangxiong/TCGAplot/blob/main/rawdata_code/meta.R to see the data processing procedure.
We have removed some duplicated samples.
I double-checked my downloaded TCGA portal data, cbioportal data, and Gepia data.It shows about 360–370 patients rather than about 335 patients. Maybe you should double-check your downloaded data in LIHC? Anyway, thanks for your reply.
meta=meta[!duplicated(meta$bcr_patient_barcode),] k1 = meta$time>=0.1 k2 = !(is.na(meta$time)|is.na(meta$event)) meta = meta[k1&k2,]
exprSet=dplyr::filter(tpm,Group=="Tumor")%>% tibble::add_column(ID = stringr::str_sub(rownames(.),1,12), .before="Cancer") %>% dplyr::filter(!duplicated(ID)) %>% tibble::remove_rownames(.) %>% tibble::column_to_rownames("ID")%>% dplyr::select(-(1:2))
s = intersect(rownames(meta),rownames(exprSet));length(s) meta = meta[s,]
We have removed duplicated barcodes, patients without survival time or the event was NA. Moreover, we intersected the sample names with expression matrix. Finally, the sample size with clinical information may be different with other online tools.
When you run get_cancers(), you will find that 50 normal and 374 tumors samples were included in the tpm expression matrix. When you run dim(get_meta("LIHC")), only 335 samples with sufficient clincial information was included in meta.
I hope to reply your inquiry.
Thanks for your great jobs! I have some confusion about the patient number.For example, in LIHC, your database includes 335 patients,but in Gepia, there are 364 patients. What is the reason? Thanks.