loosolab / TOBIAS

Transcription factor Occupancy prediction By Investigation of ATAC-seq Signal
MIT License
180 stars 38 forks source link

batch effect #241

Open drosop opened 8 months ago

drosop commented 8 months ago

Hi,

Im working on bulkATAC data. The experiment was run in two batches and when I made the PCA plot, the data is separated by batches indicating strong batch effect.

Can I use these samples with batch effect for running tobias?

Is there any way to remove to batch effect prior to tobias? I used limma to correct for batch effect for differential accessiblity analysis.

Thank you,

msbentsen commented 8 months ago

Hi @drosop ,

Unfortunately no, there is no way to add batch effect correction to TOBIAS.

If your samples are separated like this:

the effect might cancel out if you merge the .bam-files of each condition.

Another option might be to run every sample individually with TOBIAS, and then try to correct the footprint-scores manually afterwards. But all in all, there is no direct way to do this in TOBIAS, sorry!

DossenaCarolina commented 7 months ago

Hi @msbentsen,

thanks for creating of this fantastic tool! I have a follow-up question on this topic. When you mention:

If your samples are separated like this:

batch1: condition1-rep1, condition2-rep1
batch2: condition1-rep2, condition2-rep2

the effect might cancel out if you merge the .bam-files of each condition.

does this imply that overcoming batch effects might be feasible if the dataset is paired for all conditions? My ATAC-seq samples are from tumor-infiltrating (T), normal-tissue (N) and peripheral blood (PB) immune cells from distinct patients and they are clearly separated by subject. For only one patient I lack the PB sample, but currently I've included all samples in TOBIAS, merging the bam files by tissue condition (i.e. 5 T, 5 N and 4 PB). Do you think that I should remove the patient without PB to achieve a more balanced dataset (i.e. paired samples: 4 T, 4 N, 4 PB) in order to better mitigate the batch effect? Or, since my main comparison is actually T vs N, would it be better to run an analysis with only the paired T and N samples (i.e. 5 T and 5 N)?

Regarding the second option proposed from the previous answer, if I run the analysis individually for each patient, what method do you recommend for manually correcting footprint scores afterward?

Thanks again!

Carolina

msbentsen commented 6 months ago

Hi @DossenaCarolina ,

Sorry for the late reply. In regards to the paired samples, I will say that in theory the batch effects should cancel out if each condition contains paired data. So if you have something like million of reads per sample: Patient T N PB
P1 3**6 4**6 -
P2 3**6 4**6 4**6
P3 3**6 4**6 4**6
etc. ... ... ...

The percent influence of each patient when comparing T/N should be equal when using the same patients. So I would agree to run it more paired like 4T-4N-4PB or 5T-5N rather than 5T-5N-4PB.

For manually correcting footprint scores, you might look at something like limma or combat, or even just quantile normalization if the effect is only in the strength of the signal. This is not something I have done however, so I cannot speak for how well it works.

I hope that helps you out!