loosolab / TOBIAS

Transcription factor Occupancy prediction By Investigation of ATAC-seq Signal
MIT License
193 stars 41 forks source link

question: differential TF activity/score driven by accessibility #55

Closed alexyfyf closed 3 years ago

alexyfyf commented 3 years ago

Hi team,

I've tried your method to predict footprint scores (which I think is the sum of accessibility and footprint depth). The result is hard to interpret though. For example, this GATA1 is predicted to be more active along time from the volcano plot.

But when I visualize it, you can see clearly it is driven by accessibility, and the effect of footprint essentially was masked. image

And if I don't scale for y, it looks like this, and the depth is larger in the middle state. image

Could you suggest how to interpret this properly?

Thank you!

msbentsen commented 3 years ago

Hi @alexyfyf,

Yes, I agree with your observation - because the changes in accessibility were so much bigger than the changes in footprint, the accessibility ended up driving the score. Biologically, there is more evidence for a TF correlating with accessibility than it always making a footprint, so I would probably trust the accessibility more. This is also because the tall-signals seen (e.g. in 6mL) can be an effect of protein binding, although it is not a canonical "footprint". But depending on the TF (repressor/activator?) and your hypothesis, the interpretation might be different!

My only question is: Did you run ATACorrect with the same peak-set for all three conditions? (6wW, 6wL and 6mL). Otherwise, that can lead to differences in the normalization between conditions.

I am sorry that I can't give you a more definitive answer, but there are unfortunately still a lot of things still unclear in footprinting - and interpretations like this is one them!

alexyfyf commented 3 years ago

Hi @msbentsen , thank you for developing this tool and your reply. I've checked my analysis, and I found that I used corresponding peak files for each bam file when running ATACorrect and FootprintScores, but used a consensus peak set when running BINDetect. Is that an issue? So what do you suggest?

This is the script for ATACorrect and FootprintScores

BASE=$(basename $1 .final.bam)
PEAK=$(echo $1 | sed 's/\.final\.bam/_peaks\.narrowPeak/')

echo $BASE
echo $PEAK

TOBIAS ATACorrect --bam $1 --genome ../../ref/mmu/ucsc_mm10/mm10.fa --peaks $PEAK --outdir $BASE --cores 12

TOBIAS FootprintScores --signal ${BASE}/${BASE}.final_corrected.bw --regions $PEAK --output ${BASE}/${BASE}_footprints.bw --cores 12

This is the code for BINDetect

TOBIAS BINDetect --motifs JASPAR2020_CORE_vertebrates_non-redundant_pfms_jaspar.txt --signals 6wW/6wW_footprints.bw 6wL/6wL_footprints.bw 6mL/6mL_footprints.bw --genome ../../ref/mmu/ucsc_mm10/mm10.fa --peaks ../atac_diff/output/consensus_peaks.bed --outdir BINDetect_output --time-series --cores 12 --cond-names WE preLSC LSC
alexyfyf commented 3 years ago

I've also tried to make a plot from deeptools, using the corrected bw files and {TF}_all.bed for example for this GATA1

computeMatrix scale-regions -R ../BINDetect_output/GATA1_MA0035.4/beds/GATA1_MA0035.4_all.bed -S ../6wW/6wW.final_corrected.bw ../6wL/6wL.final_corrected.bw ../6mL/6mL.final_corrected.bw  -o out.mat.gz -b 100 -a 100  -bs 2 -p 12 --skipZeros
plotProfile -m out.mat.gz -out test1.png

image

And it looks like the accessibility around the motifs is not very different. But it looks different compare to the output from plotAggregate from TOBIAS.

msbentsen commented 3 years ago

I've checked my analysis, and I found that I used corresponding peak files for each bam file when running ATACorrect and FootprintScores, but used a consensus peak set when running BINDetect. Is that an issue? So what do you suggest?

Hi, I suggest to use a merge of all condition peaks as input for the ATACorrect and FootprintScores - sorry that this was not more clear! In your case, that would be the merge of your .narrowPeak files. The reason is that you need to have values across the whole .bed-space for comparing in the BINDetect-run. Otherwise, you will be comparing with "0" for differential peaks (rather than the true value), and this can skew the normalization. This might be what has happened here, so I hope using the same .bed-file will solve the issue.

Is ../atac_diff/output/consensus_peaks.bed a merge of all or a file with common peaks? In any case, it is best if --bed/--peaks are the same across all runs.

alexyfyf commented 3 years ago

I've checked my analysis, and I found that I used corresponding peak files for each bam file when running ATACorrect and FootprintScores, but used a consensus peak set when running BINDetect. Is that an issue? So what do you suggest?

Hi, I suggest to use a merge of all condition peaks as input for the ATACorrect and FootprintScores - sorry that this was not more clear! In your case, that would be the merge of your .narrowPeak files. The reason is that you need to have values across the whole .bed-space for comparing in the BINDetect-run. Otherwise, you will be comparing with "0" for differential peaks (rather than the true value), and this can skew the normalization. This might be what has happened here, so I hope using the same .bed-file will solve the issue.

Is ../atac_diff/output/consensus_peaks.bed a merge of all or a file with common peaks? In any case, it is best if --bed/--peaks are the same across all runs.

Hi, thanks for the explanation, and the consensus_peaks.bed is kind of common peaks. So I assume you suggest to do something similar to cat *.narrowPeaks | sort -k1,1 -k2,2n | bedtools merge, which will give you cut counts in all regions covered in any condition.

msbentsen commented 3 years ago

Hi, thanks for the explanation, and the consensus_peaks.bed is kind of common peaks. So I assume you suggest to do something similar to cat *.narrowPeaks | sort -k1,1 -k2,2n | bedtools merge, which will give you cut counts in all regions covered in any condition.

Yes, this is exactly what I mean - I use the exact same command. In that way, you are sure that scores for all regions of interest (common and differential in both directions) are available in the files.

alexyfyf commented 3 years ago

Hi, thanks for the explanation, and the consensus_peaks.bed is kind of common peaks. So I assume you suggest to do something similar to cat *.narrowPeaks | sort -k1,1 -k2,2n | bedtools merge, which will give you cut counts in all regions covered in any condition.

Yes, this is exactly what I mean - I use the exact same command. In that way, you are sure that scores for all regions of interest (common and differential in both directions) are available in the files.

Sure, I'll follow your advice and rerun it. Thank you so much! I'll update if any more question if that's ok?

msbentsen commented 3 years ago

You're welcome, and yes sure! I will close this issue for now, but you can always reopen if you have any more questions.