loosolab / TOBIAS

Transcription factor Occupancy prediction By Investigation of ATAC-seq Signal
MIT License
191 stars 41 forks source link

False Positive Footprints on Genomic DNA? #74

Closed Jome0169 closed 3 years ago

Jome0169 commented 3 years ago

Hello,

I'm interested in using TOBIAS on a recent dataset I have, and I have some questions about using genomic DNA as a potential control to identify false positive footprints.

A bit of background - I'm working on plant species, and working on root tissue. Here labeled as "crown root". Genomic DNA is simply genomic DNA from the same species treated with Tn5 which we should not be able to call footprints on,

In order to identify binding sites which could be potential false positives I called peaks on the crown root ATAC-seq library. I then fed these peaks to TOBIAS, and identified and scored footprints on both the genomic DNA, as well as the actual crown root library. Upon doing so, what I saw was somewhat intriguing, and I'm curious what to make of these results.

Firstly we're able to identify footprints in genomics DNA. The total number of bound TFs from genomic DNA is 186094 and the total from the crown root dataset is 280507. The associated values can be found below in the table:

Library Total TF Binding Sites Number of Bound TFs Average Mean Score Proportion of Total bound
Input 5555098 186094 2.24421 0.03349968
Crown Root 5555098 280507 1.26299 0.05049542

For whatever reason the mean score associated with each transcription factor is higher in the input libraries than it is the real tissue. Why might this be? Is it due to the scores simply going to the mean of the cut sites regions in the background? And if so - does that mean my blacklist isn't comprehensive enough? Or can the scores here be seen as the "worst case scenario"?

score_bound_hist

The overlap between these two footprint datasets is also quite low. Below is a histogram relating the proportion of bound TFs which overlap crown root are also present in the genomic DNA input library. Overall it's around 7% for the mean number of motifs found to be bound in both sample. Should I go ahead and remove any of the genomic input footprints I'm able to call for all future analysis?

crown_input_gDNA_histogram

Finally the aggregate footprint figures do look convincing. For example here is the aggregate output from one transcription factor

At4g38000_C2C2dof_tnt At4g38000_col_a_m1_footprint

And the same transcription factor from the genomic input.

Input At4g38000_C2C2dof_tnt At4g38000_col_a_m1_footprint

I guess I'm uncertain what to make of these results, and am curious what the authors would recommend. I guess my questions boil down to the following:

1) Why are we able to call footprints on genomic DNA in Tobias? Is this something to do with the way footprints are scored? And if so, how might I take this into consideration in future analysis? Increasing the stringency? Or the False posotive rate?

2) Should i be using genomic DNA as a control for all future experiments? For instance, should I just go forward and remove all overlapping motifs which we are able to call footprints on using input data? And should this extend to all future analysis? If a footprint can be identified in a specific motif using input, should it be masked for all future analysis and enter the "blacklist"?

Thanks for any input you might have. I really like the software, and think it's super useful.

msbentsen commented 3 years ago

Hi,

Thank you for reaching out and for the plots! Interesting project, and I really like that you did genomic DNA as well.

Were the 'crown_root' and 'genomic_input' run together or in two different BINDetect-runs? (I would probably run them together). I am just wondering what peaks you used. My impression is that you should not be able to call ATAC-seq peaks from genomic input, as there should be no proteins/nucleosomes to disturb Tn5 cutting. Therefore, the reads should be more spread across the whole genome - is this the case?

As specific answers:

  1. TOBIAS tries to normalize the two conditions against each other, and this might fail for conditions very different from each other. I suspect the 'genomic_input' is bumped artificially high in this case, because TOBIAS assumes that the reads-in-peaks ratio should be the same as for 'crown_root'. The footprints can still come out as "bound" just due to the level of accessibility, so that might be what is happening. This is unfortunately an inherent feature of TOBIAS.

  2. As you say, you might consider removing the "false positive" footprints from the data. But I would also first try to open those in a genome browser (e.g. IGV) to see what is going on. Sometimes it is very clear if the regions are repeats/mapping error/etc., and should be removed. Otherwise, I think the genomic DNA is a great control in your footprint visualization. So you might consider not using it for actual footprint calling (because you know there should not be any), but for the visualization as a "proof of footprint".

Let me know about the peaks and then we can discuss how to best integrate the genomic DNA data.

Jome0169 commented 3 years ago

Thanks for the response. This was very illuminating as to what might be happening. To answer your questions the data I'm showing is from two separate runs of BINDetect. And as for the peaks I used - I used the single set of peaks present in the crown root dataset. This is because, as you stated, I am unable to call peaks on input genomic DNA as the integration events are random.

I went ahead and repeated the analysis using BINDetect on the input and crown root dataset together. Below are the results.

Here are the output PDFs: Screen Shot 2021-06-09 at 10 52 20 AM

Screen Shot 2021-06-09 at 10 53 06 AM

Your explanation for the TOBIAS bumping up the input library would make sense to the patterns we're seeing. And it's good to see that after normalization of the scores - the number of zero scores in the fake input data set shoots up. But still - there persist a decent number of footprints with high scores.

Overall there are more bound TFs in the crown root library than the input library. But there are still a substantial number of "bound" sites in the input library. number_bound_tfs

Additionally scores in the input library are on average slightly lower than the crown root libraries as well as the proportion of each transcription factor bound being higher in the crown root library.

Mean_tf_score Prop_tfs_binding

You mention in the documentation that the change score could be illuminating. Here it is plotted below: Crown_root_change

So it looks like this negative change score generally indicates a the second condition (here Crown root) has more actual binding sites. Which does make biological sense. Is that the correct interpretation here?

Finally - I checked the intersection between the CrownRoot and the input libraries to see if any regions are called to be "bound" in both. And it appears there are only a handful for each TF.

Overlapping

Finally - I agree with your point about visualization. But rather, at this juncture I'm more concerned about controlling for erroneous footprints in my data. So - overall I'm wondering what I can say about false positives in regards to footprinting and input genomic DNA? I think it's obvious that I need to discard any footprint which is able to identified in both the crown root dataset - as well as the input libraries. But is there a way to leverage the genomic DNA in this way to increase my certainty about the footprints output from TOBIAS?

Thanks again for your help.

msbentsen commented 3 years ago

Hi,

Thank you for this update! Indeed, as you see in the normalization, BINDetect will always try to push the scores into the same ranges. So in this case, although you know that your genomic DNA should have less detectable footprints, the scores will still be pushed into the same range; thus calling footprints based on accessibility. It is therefore hard to use the numbers of bound TFs to compare the two conditions.

One way to try to control this might be to look into the "_overview.txt" files per TF (located in <bindetect_output>/<TF>/<TF>_overview.txt/xlsx). If you run 'CrownRoot' and 'Fake' together, you will have a column with "CrownRoot_Fake_log2fc", which gives you the log2 foldchange of footprint scores of each condition. You can use this to filter the individual sites for only those, which have higher binding in CrownRoot.

I think this is a bit of a "hacky" way to do it, but given the assumption that sites with high footprint values in Fake are somehow false-positives/blacklisted sites, these should show up as having equal/higher values than CrownRoot.

Jome0169 commented 3 years ago

Hello,

Thanks for all you help. After doing further analysis it would appear that you were correct in saying that a substanial issue I was missing was mappability differences between the two datasets. I forgot the input dataset consisted of 35 BP reads, so TOBIAS was identifying lack of integration events simply due to lack of data.

Additionally when looking at the footprint depth metrics of Bound/Unbound regions before and after TOBIAS correction, it appears that the genomic DNA scores are being substantially lowered. Indicating that TOBIAS is correcting for at least some of the issues caused by using input genomic DNA.

image

Thanks so much for your help. I feel confident now about what may be happening.