loosolab / TOBIAS

Transcription factor Occupancy prediction By Investigation of ATAC-seq Signal
MIT License
188 stars 40 forks source link

Queries on generation of TOBIAS paper's heatmap and BINDetect's time series function. #5

Closed rdbcasillas closed 4 years ago

rdbcasillas commented 4 years ago

Hello Mette,

Thanks again for the TOBIAS software, we are finding it very useful for our project. This time around we had a question regarding the output in Fig 2 of the preprint, clustering TFs based on co-activity. We have time-series ATAC datasets that we analyzed using time-series flag in TOBIAS BINDetect command. We extracted mean scores from bindetect_results.txt which we then used to cluster and visualize using heatmap.2. Following are the output figures.

devseries2 E11toE14

In two rounds of analyses (first is E11 to Adult and second is E11 to E14), we see overall low activity for all TFs in E11 and E12, which doesn't fully reflect biology. This led to couple of questions on which any insight is appreciated:

  1. How was the figure 2 in preprint generated? I am assuming that the heatmap is simply TOBIAS mean scores for each TF across different samples (z-scored by TFs).

  2. Is our concern at seeing two ages show low activity (compared to all others) for all TFs, a valid concern? Or is there something we can tweak in the pipeline to better reflect differential activity for TFs across age?

Thank you!

Vatsal

msbentsen commented 4 years ago

Hi Vatsal,

I will try to give a few answers and some suggestions:

  1. Figure 2 was created exactly as you describe: Taking the mean scores of _bindetectresults.txt and plotting the row z-score normalized data.

  2. I agree that the low scores of E11-E12 looks more like a technical effect rather than a biological one, so this is a very valid concern! Which version of TOBIAS are you using? (TOBIAS --version can give you this). I am asking because I added normalization of the input values in TOBIAS 0.4.0. What this means is that BINDetect tries to normalize all conditions towards each other using quantile normalization (there is also a page in the _bindetectfigures.pdf showing this). If E11-E12 have some large outlier values or are very different in terms of chromatin structure from the other samples, it might be that the internal normalization did not handle so well. Did you remove ENCODE blacklisted regions from the peaks? These often lead to issues.

However, I think my main question would be whether the ATAC-seq libraries per condition are comparable in terms of quality (so number of reads in peaks, fragment size distribution etc.)? I have seen in the past that there are some libraries that don't perform so well in footprinting, although I have yet to figure out the specifics of this (might be extent of Tn5 digestion, related to cell type etc., currently working on this).

Let me know if the quality might be an issue, otherwise I will consider tweaking the normalization or maybe adding a "--norm-off" parameter so that you have more control over the normalization outside of BINDetect.

Best Mette

rdbcasillas commented 4 years ago

Thank you Mette!

The version that I am using is 0.9.0. Also, we are using the ENCODE ATAC-Seq pipeline which automatically filtered the blacklisted regions from the narrowPeak files.

The libraries for the time series come from ENCODE consortia, therefore all of them have passed the QC checks for number of reads, fragment size distribution etc.

I am curious about the tweaks to normalization that you refer to. Is this feasible? Will you need any information about the data to be able to test that functionality? Let me know if you need any information about the input or output that will help you with this. I have attached the normalization picture from bindetect_figures.pdf but I am not sure if that's helpful.

image

Thank you again Mette, really appreciate helping us think through another aspect of TOBIAS's results.

Regards, Vatsal

msbentsen commented 4 years ago

Hi again Vatsal,

Ah I see, that's great, then we can exclude quality as an issue. Looking at the plots, I do have an idea where the problem arises: As you see, there are a lot of 0-scores for each condition. This just means that there are ATAC-seq peaks in the merged peak set, which are exclusively "open" for maybe 1-2 time points, and are "closed" (0) in the rest of the time points. It is hard to see, but E11-E12 may have fewer open chromatin regions in total, meaning that the number of "closed"/0-regions will be higher, bringing down the mean scores. This is definitely something that BINDetect should handle better!

If you could share the ENCODE accessions for the time series data, that would be really helpful - then I will try to have a look at how such 0's are handled, but I of course can't promise that it will be a quick fix.

In the meantime, you might have a look at the columns in _bindetectresults.txt named "*_change". These values are normalized to the background change between each condition, and should be more robust. If you plot the heatmap with "E11_E12_change", "E12_E13_change" etc., this should also give you an idea of the up/down changes for each TF.

Hope this is of help to you!

Best Mette

rdbcasillas commented 4 years ago

If you could share the ENCODE accessions for the time series data, that would be really helpful

Sure, they are over here.

In the meantime, you might have a look at the columns in bindetect_results.txt named "*_change".

Thank you for suggesting this, will check this out.

I do want to mention that I created the merged peak file that goes into BINDetect, by simply concatenating all the peak files together. Is this the right way to create this file?

Regards, Vatsal

msbentsen commented 4 years ago

Thanks for sharing the link to the data, Vatsal! I will have a look at it as soon as possible.

And regarding the peak file, you are completely right to concatenate the peak files together for BINDetect. However, you just have to make sure that the input footprints were also calculated from the same concatenated peak-file. Even though the footprint scores will be close to 0 for many regions, it is important to include all regions in the calculation.

msbentsen commented 4 years ago

Hi again Vatsal,

I had a look at the data, but I cannot reproduce the low scores of E11-E12. For my example, I ran BINDetect for 356 motifs from HOCOMOCO, and created the heatmap from the mean scores as seen here: forebrain

Here, there are cases of both early and late factors, as well as binding specifically for certain time points.

As I mentioned above, I think what happened in your run was that the footprinting scores were only calculated for the condition peaks, and not the concatenated peak set. Thereby, a lot of scores pulled from the concatenated peak sets were 0 for each condition, which seems to break the normalization. This should have been a bit more clear in the documentation and I will try to add this in the next update.

I will close this issue, but feel free to open if you have any follow-up questions!

rdbcasillas commented 4 years ago

Hi Mette,

Thank you so much for investing time on this and helping out.

This is interesting. So I am really curious about few things and appreciate any help:

1) Will it be possible for you to share the raw file that you used to create this heatmap? That will help me verify the numbers on my end. Here's the gist for the file that went into creating my heatmap after concatenating the peak files for footprinting as you suggested.

2) What package are you using for clustering and heatmap generation? I doubt that to be the cause of difference but it will be good to check. I am using gplots's heatmap.2.

3) TOBIAS BINDetect also has a flag "--peak-header". I haven't used this flag until now and I wonder if that can have any effect.

Thanks a lot again!

Vatsal

msbentsen commented 4 years ago

Hi Vatsal,

  1. You're welcome! Sure, the data that I used for my heatmap is here (data is before z-score normalization).

  2. I used seaborn.clustermap in Python, which does the z-score normalization and clustering internally.

  3. The --peak-header has nothing to do with the calculation of the scores, but is used to give the header of the peak .bed-file. There is a good example in the test_data here. Adding a file using --peak-header header.txt will give the output files of transcription factor binding sites the correct header - otherwise, any additional columns will be named "additional_1", "additional_2", etc. I always use it to bring annotation from peaks into the individual binding sites files!

Best regards, Mette

rdbcasillas commented 4 years ago

Hello Mette

Thank you for answering those questions and apologies for the delayed response, I was away from work. One thing I forgot to ask last time was about the peak calling software. Which peak calling software did you use for generating the peaks? And did you simply merge the peaks from different replicates together? I wonder if differences in strategies for generating peaks may be causing changes in TOBIAS footprint detection (and hence the mean score).

Thank you, Vatsal

msbentsen commented 4 years ago

Hi Vatsal,

Sure, I used MACS2 for peak-calling (parameters "--nomodel --shift -100 --extsize 200 --broad") and merged all peaks across all replicates and time points to create a "master"-merged-peak-file.

I think the most important thing here is that you also calculate all corrected Tn5 signal and footprinting scores from this merged peak file, and not on the single replicate peaks. That will ensure proper comparison across all conditions.

Best Mette