nixstix / CMI-PB2

Predictive modeling of multi-modal assay data from subjects vaccinated against Pertussis
MIT License
1 stars 0 forks source link

Use of batchCorrected data in DESeq2 #1

Open rialc13 opened 3 days ago

rialc13 commented 3 days ago

Hi Nicky,

Thanks for creating this repo with so much organization & thorough documentation!

I am trying to replicate your work & found that in line no. 45 of pre-processing/prep_train_data2020+21.R script, you mentioned that

the input for this section should not be batch-corrected or TPM data, but raw count data

The subsequent code uses counts = dat$pbmc_gene_expression$batchCorrected_data & also performs batch correction using limma assay(vsd) <- limma::removeBatchEffect(assay(vsd), batch=as.vector(dds$dataset)).

Hence, my confusion & I was wondering, did you mean to use counts = dat$pbmc_gene_expression$raw_data?

nixstix commented 1 day ago

Hi @rialc13 ,

Thanks for the question - I can see it’s a bit confusing. When I ran the initial code for the competition, I ran a batch correction on data that was already batch corrected. But of course this is not the correct way to do it (I kept the code as-is for reproducibility).

As you point out, the correct way would be to run it on the raw data. counts = dat$pbmc_gene_expression$raw_data is not raw count data, it is TPM-corrected data, so you would not need to apply another transform such as vst.

In summary, to do this step properly, if you want to run your own batch correction instead of using the batch-corrected data provided, there are 2 possible ways: 1) start with TPM data tpm=dat$pbmc_gene_expression$raw_data (make sure it is in log space, I can't remember if it is), then run the batch correction limma::removeBatchEffect(tpm, batch=as.vector(dds$dataset)) . This should give you pretty similar results to the dat$pbmc_gene_expression$batchCorrected_data that is already there. 2) start with the original raw count data (this may still be available through the website), run a normalization other than TPM-normalisation, such as vst , then run batch correction limma::removeBatchEffect(assay(vsd), batch=as.vector(dds$dataset)).

I hope this is a bit clearer!

Best,

Nicky