loosolab / TOBIAS

Transcription factor Occupancy prediction By Investigation of ATAC-seq Signal
MIT License
195 stars 41 forks source link

Different locus numbers across conditions and Strategies for Multi-Tissue TF Footprint Score Comparison #239

Closed Myrtle-bio closed 3 months ago

Myrtle-bio commented 1 year ago

Hello, I appreciate your continued assistance, it has been very useful to me!!

I am working on multiple conditional data, And I want to identify which TFs are important in each condition in certain bed regions. Just similar with your research.

image

I've observed significant variations in the locus numbers within BINDetect results across different conditions. For instance, in //TF1_overview.txt, there are over 20,000 rows, whereas in //TF1_overview.txt, there are over 30,000 rows.

image Based on this description ,I initially presumed that the locus numbers would be consistent across conditions.

image Based on this, Could the TFBS with no output be a reason for F[i, i+Wf] < 0? I noticed that there are TFBS_footprints_condition_score=0 in the output.

Here comes the following questions:

  1. When comparing the significance of TF1 in my region of interest across conditions, and given there are differing numbers of binding sites in each condition, For example, the BINDetect result shows 7 sites in condition1, but shows 7 same sites and 3 more sites in condition2 . should I use the maximum TF_condition score or the mean TF_condition score? I lean towards the mean strategy from a biological perspective, but I'm unsure if it's fair to divide condition1, which has only 7 binding sites, by 7 when it seems that in three other sites, condition1 may not even have binding, unlike condition2. However, if divided by 10, it seems that the footprint scores on the other three sites are not necessarily 0, as I mentioned earlier, there are TFBS_footprints_condition_score=0 in the output
  2. After obtaining the mean TF_condition score for each condition, you mentioned that image So I think maybe I don't need to perform additional normalization, right? But I've noted a clear bias in certain situations, but biologically, it seems improbable that all TFs would exhibit this pattern. image Could I be overlooking something? To clarify, I use ATAC peaks from the entire genome as input. I then employ bedtools intersect with the BINDetect results and the regions of interest to obtain the footprint scores within those specific regions. This approach differs from directly using the peaks within my regions of interest as input.

I apologize for the barrage of questions, and I hope you have a wonderful Halloween!

Myrtle-bio commented 1 year ago

If I want to study which TF has the max ft_Score in the regulatory region of a specific gene, which means the most important, and there are multiple sites for the same TF within that regulatory region, which approach is more suitable - taking the maximum value or the average value?

Myrtle-bio commented 1 year ago

As mentioned in your article, not all TFs will necessarily form a TF footprint. So, if in condition1, TF1 has a mean footprint score of 1.3, and TF2 has a mean footprint score of 1.5, Does it necessarily imply that TF2 is more robust or important or not?

msbentsen commented 1 year ago

Hi @Myrtle-bio,

Thank you for your questions - I will try to summarize here:


If I want to study which TF has the max ft_Score in the regulatory region of a specific gene, which means the most important, and there are multiple sites for the same TF within that regulatory region, which approach is more suitable - taking the maximum value or the average value?

I think this depends on the biological question, but mostly I would recommend to take the maximum value. I think we can assume that transcription factors can have more than one possible binding site in a region, but that the one with the largest footprint score is the most likely to be bound in that condition.


As mentioned in your article, not all TFs will necessarily form a TF footprint. So, if in condition1, TF1 has a mean footprint score of 1.3, and TF2 has a mean footprint score of 1.5, Does it necessarily imply that TF2 is more robust or important or not?

If TF2 has a higher footprint score than TF1, it means that TF2 it shows more robust footprint/accessible signal, but that does not necessarily mean that it is more important. This is similarly to expression of genes, where the highest expressed genes are not necessarily the most important. For this reason, why usually only compare footprint scores per TF across conditions, and not between different TFs, as it is difficult to compare footprint scores for different TFs.

I hope these answers covered the questions!

Allischoo commented 4 months ago

Hi!

I have what I think is a related question as it concerns the score normalisation. I am trying to compare footprinting between conditions at specific regions of the genome (ROIs) vs genome-wide. My conditions have very different genome-wide coverage (control >> treated) with more similar and higher coverage at the ROIs.

I have tried this two ways with very different results. I always input the corrected signal bigwigs in the same bindetect run:

  1. --peaks == genome-wide peaks --output-peaks == ROI peaks
  2. --peaks == ROI peaks (separately)

Option 1 seems to over-estimate the footprints of the treated sample vs control in the ROIs, which I assume is due to normalising the scores by the genome-wide quantiles. However, option 2 returns significantly fewer footprinted regions across TFs in the ROIs compared to option 1. I get >1000 sites in most instances. Do you think this is sufficient for the background correction? I was also wondering whether there is a sane way to compare without quantile normalisation in this case.

Thank you!

hschult commented 4 months ago

Hi @Allischoo,

Your first version is the recommended approach. BINDetect will generate a background distribution by randomly subsetting peaks. This could explain why you see fewer sites with option 2 since your ROI peaks may have higher scores than the full set of peaks. Regarding your concern about over-estimation, you can disable quantile normalization with --norm-off if you think that a global normalization is not applicable in your case. However, I strongly recommend looking into the composition of your data e.g. distributions of your samples before doing so.

github-actions[bot] commented 3 months ago

No activity for at least 30 days. Marking issue as stale. Stale issues are closed after one week.