loosolab / TOBIAS

Transcription factor Occupancy prediction By Investigation of ATAC-seq Signal
MIT License
188 stars 40 forks source link

Clustergram --- "(Clusters below threshold are colored)" #212

Closed c2b2pss closed 1 month ago

c2b2pss commented 1 year ago

Hi,

  1. The phrase "below threshold" is a tiny bit confusing. So colored clusters are not significant?
  2. Also, in the same figure "Differential binding score" there are some circles that are solid, and some open. Any specific reason or just for distinguishing?

Many thanks for an excellent program!

image

msbentsen commented 1 year ago

Hi @c2b2pss

  1. The threshold refers to the "transcription factor distance", which is lower for TFs with high similarity (as they are close to each other). Thereby, all clusters below the threshold (I believe it is 0.5) are colored. Single clusters or clusters above threshold are black.
  2. Ah yes, I see this is not so clear. The solid circles are the TF motifs which are highlighted in the upper volcano plot, whereas the transparent circles are not significant. I find that some pdf readers cut off the top plot if the page is too long, so please try another pdf reader in case you do not see the volcano.
c2b2pss commented 1 year ago

Thanks Mette!

  1. Just to clarify : for the open/solid circles , as example in the figure attached, NFKB2 vs NFKB1, and then both vs TYY2 your statement would still be confusing.

  2. Is it usual to have differential binding delta be the order of 0.2 - 0.3?

Thanks again for your time.

msbentsen commented 1 year ago

Hi,

  1. Open/solid circles refer to the thresholds of "differential binding score" as seen in the upper volcano plot. The coloring of the TFs refer to the clusters in the dendrogram (right side), and whether these are below the distance threshold. The dendrogram coloring is independent from the open/solid circles. Hope this makes more sense.

  2. Yes this is usual. The differential score is an estimation of a "Cohen's d effect size", where 0.2 is assigned "small" change, so I usually set 0.15-0.2 as rough threshold (even if TFs below the threshold are shown). But this might depend on the data. I found an overview on wikipedia of different effect sizes: image So in your case, the majority of TFs are not changed (which is to be expected), but for example HSF1 would be assigned to have a small change between the conditions.

sufyazi commented 1 year ago

Hi Mette,

I am not sure if I should open a new issue because this thread has clarified a few of my questions, and I think my follow-up question is related to this topic.

In the bindetectresults.txt, there is a column named 'cluster'. Most of the values are prefixed with 'C' and a TF name. It is apparent to me that this represents some form of TF clustering, but it was not clear what kind of clustering is done here.

So based on your replies in this thread, is this assumption of mine correct?

  1. The clusters in the output text file refer to TF distance plotted here in the clustergram in the output pdf. The 'similarity' here refers to the degree of similarity between different TFs' MOTIFS (not the TFs themselves) – in other words if motif A is associated with TFA, and motif B is associated with TFB, and the two motifs are kinda similar (I assume, base pair overlap?) they would have a closer 'TF distance' and be clustered together. It feels logical to me but in the documentation, it does not seem to match this understanding.

/TF_distance_matrix.txt Distance matrix used to cluster the transcription factors in the bindetect_figures-dendrograms. This is based on the overlap of individual transcription factor binding sites.

Instead, it sounds like the clustering is based on how many TF motifs overlap a genomic region – so if TFA motif and TFB motif are found overlapping to a certain degree across a similar genomic region, they would cluster together in the dendrogram. Is this right?

  1. What determines the 'TF name' that gets assigned to the 'cluster' column in the final text file? would it be C_TFA, or C_TFB, based on my hypothetical example? I was looking at my output data and I could not see the pattern or logic of the naming assignment. Based on the documentation,

    Motif clustering based on the overlap of all identified TFBS per motif. The clusters are named according to one representative TF from each cluster.

This is not clear at all to me. Programming-wise, how is 'representative' TF chosen? Just the first TF in the cluster?

AnasAnsari123 commented 4 months ago

Hi ! firstoff all i wish to say it was really a good tool.actualy now i am working on atac sequencing . so i had a query regarding bindetect while using this function i got a bindetect figures pdf file which is a quite big file that contains volcano plot as well as dendogram .but my problem is i cant view the volcano plot clearly and i need to skip the formation of dendogram. i only needed the volcano plot that is formed. so is there any parameters to skip the formation of dendogram and to view the volcano plot clearly.

mohobein commented 4 months ago

Hey @AnasAnsari123,

have you tried opening the file using a different pdf viewer yet? Sometimes, certain viewers struggle with this file as the page containing both the volcano plot and the dendrogram is quite long. I have never had trouble viewing the plot using the browser pdf viewers of Firefox, Edge, and Chrome. Otherwise, there should also be a interactive html plot of the volcano plot depicting the same thing in the BINDetect output directory.

I hope this helps. If the pdf viewer was not the problem, let me know and we can look further.

Best regards, Moritz

c2b2pss commented 4 months ago

Dear Moritz,

I am having a different issue, but it looks like I am tagged on this other one.

Can you please look at my snakemake pipeline issue?

Thanks!


From: Moritz Hobein @.> Sent: Wednesday, May 15, 2024 9:46 AM To: loosolab/TOBIAS @.> Cc: Subramaniam, Prem S. @.>; Mention @.> Subject: [EXTERNAL] Re: [loosolab/TOBIAS] Clustergram --- "(Clusters below threshold are colored)" (Issue #212)

Hey @AnasAnsari123https://github.com/AnasAnsari123,

have you tried opening the file using a different pdf viewer yet? Sometimes, certain viewers struggle with this file as the page containing both the volcano plot and the dendrogram is quite long. I have never had trouble viewing the plot using the browser pdf viewers of Firefox, Edge, and Chrome. Otherwise, there should also be a interactive html plot of the volcano plot depicting the same thing in the BINDetect output directory.

I hope this helps. If the pdf viewer was not the problem, let me know and we can look further.

Best regards, Moritz

— Reply to this email directly, view it on GitHubhttps://github.com/loosolab/TOBIAS/issues/212#issuecomment-2112588039, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHDGS6OSWPRQPNMMRDQX2JTZCNRMZAVCNFSM6AAAAAAX26PISSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJSGU4DQMBTHE. You are receiving this because you were mentioned.Message ID: @.***>

AnasAnsari123 commented 4 months ago

Thank you for ur reply Moritz

github-actions[bot] commented 2 months ago

No activity for at least 30 days. Marking issue as stale. Stale issues are closed after one week.

AnasAnsari123 commented 1 month ago

Dear mortiz, I have a problem with bindetect, (bindetectresults.xlsx) in this file in the names column it is showing blank and cluster column is also showing like C. names are not showing because of that in the motif clusters only one row scores are shown why it is showing like that previously it worked well but now it is showing like this.