Gene expression data processing. Ilumina HiSeq RNA Sequencing data, processed by reads per kilobase per million (RPKM) method from TCGA (i.e. RNASeqV2), has been downloaded and comprehensive data quality control has been performed.
Primary and metastatic tumor samples were collectively processed by log2(RPKM + 1) transform, followed by quantile-normalization. From then, we split the data into primary-/metastatic-specific gene expression matrix, corrected for batch
effects by the center, platform, and tissue source site (transcription starting site (TSS)) ids from TCGA sample barcodes, and corrected for confounding factors including race, age, and gender by capturing residuals with intercepts from linear
regression model by lm() function from R software (version 3.4.2). This resulted in 103/353 annotated primary/metastatic tumor tissue samples across 19,047 genes.
what are codes to get the corrected data using lm()?
fit <-lm( value~as.factor(center)+as.factor(tissue_source_site)+as.numeric(age_at_index)+as.factor(race)+as .numeric(gender), data = Data)
fit$residuals
Is it correct?
Gene expression data processing. Ilumina HiSeq RNA Sequencing data, processed by reads per kilobase per million (RPKM) method from TCGA (i.e. RNASeqV2), has been downloaded and comprehensive data quality control has been performed. Primary and metastatic tumor samples were collectively processed by log2(RPKM + 1) transform, followed by quantile-normalization. From then, we split the data into primary-/metastatic-specific gene expression matrix, corrected for batch effects by the center, platform, and tissue source site (transcription starting site (TSS)) ids from TCGA sample barcodes, and corrected for confounding factors including race, age, and gender by capturing residuals with intercepts from linear regression model by lm() function from R software (version 3.4.2). This resulted in 103/353 annotated primary/metastatic tumor tissue samples across 19,047 genes.
what are codes to get the corrected data using lm()? fit <-lm( value~as.factor(center)+as.factor(tissue_source_site)+as.numeric(age_at_index)+as.factor(race)+as .numeric(gender), data = Data) fit$residuals Is it correct?