songw01 / MEGENA

Multiscale embedded gene co-expression network analysis
GNU General Public License v3.0
48 stars 16 forks source link

How to get the corrected data using lm()? #11

Closed WHOISSIILVIA closed 3 years ago

WHOISSIILVIA commented 3 years ago

Gene expression data processing. Ilumina HiSeq RNA Sequencing data, processed by reads per kilobase per million (RPKM) method from TCGA (i.e. RNASeqV2), has been downloaded and comprehensive data quality control has been performed. Primary and metastatic tumor samples were collectively processed by log2(RPKM + 1) transform, followed by quantile-normalization. From then, we split the data into primary-/metastatic-specific gene expression matrix, corrected for batch effects by the center, platform, and tissue source site (transcription starting site (TSS)) ids from TCGA sample barcodes, and corrected for confounding factors including race, age, and gender by capturing residuals with intercepts from linear regression model by lm() function from R software (version 3.4.2). This resulted in 103/353 annotated primary/metastatic tumor tissue samples across 19,047 genes.

what are codes to get the corrected data using lm()? fit <-lm( value~as.factor(center)+as.factor(tissue_source_site)+as.numeric(age_at_index)+as.factor(race)+as .numeric(gender), data = Data) fit$residuals Is it correct?