Open rialc13 opened 3 days ago
Hi @rialc13 ,
Thanks for the question - I can see it’s a bit confusing. When I ran the initial code for the competition, I ran a batch correction on data that was already batch corrected. But of course this is not the correct way to do it (I kept the code as-is for reproducibility).
As you point out, the correct way would be to run it on the raw data. counts = dat$pbmc_gene_expression$raw_data
is not raw count data, it is TPM-corrected data, so you would not need to apply another transform such as vst
.
In summary, to do this step properly, if you want to run your own batch correction instead of using the batch-corrected data provided, there are 2 possible ways:
1) start with TPM data tpm=dat$pbmc_gene_expression$raw_data
(make sure it is in log space, I can't remember if it is), then run the batch correction limma::removeBatchEffect(tpm, batch=as.vector(dds$dataset))
. This should give you pretty similar results to the dat$pbmc_gene_expression$batchCorrected_data
that is already there.
2) start with the original raw count data (this may still be available through the website), run a normalization other than TPM-normalisation, such as vst
, then run batch correction limma::removeBatchEffect(assay(vsd), batch=as.vector(dds$dataset))
.
I hope this is a bit clearer!
Best,
Nicky
Hi Nicky,
Thanks for creating this repo with so much organization & thorough documentation!
I am trying to replicate your work & found that in line no. 45 of pre-processing/prep_train_data2020+21.R script, you mentioned that
The subsequent code uses
counts = dat$pbmc_gene_expression$batchCorrected_data
& also performs batch correction using limmaassay(vsd) <- limma::removeBatchEffect(assay(vsd), batch=as.vector(dds$dataset)).
Hence, my confusion & I was wondering, did you mean to use
counts = dat$pbmc_gene_expression$raw_data
?