Interesting pattern in FP score distribution

loosolab / TOBIAS

Transcription factor Occupancy prediction By Investigation of ATAC-seq Signal

MIT License

188 stars 40 forks source link

Interesting pattern in FP score distribution #190

Closed maxdudek closed 1 year ago

maxdudek commented 1 year ago

Hi,

I have been using TOBIAS as part of my research project, and it's been a great tool.

I ran TOBIAS ATACorrect and ScoreBigwig on about 50 or so bam files, and plotted the FP score distribution (at a random-ish subset of sites) for each sample:

I'm very curious about the tri-modal pattern that appears in most of these samples, and wonder if you can think of any explanation for it. The peaks occur below the binding threshold, so I'm not concerned about this affecting important results, but it would be nice to know what's happening.

We've tried to separate these sites by:

Length and read depth of the ATAC-seq peaks they belong to
The footprint and background widths that TOBIAS selects when calculating FP score (w_b and w_f). in order to determine if they are associated with the different distribution peaks.

If you have any thoughts on why this happens, it would be greatly appreciated!

msbentsen commented 1 year ago

Hi,

Without knowing the biological background of the data, my initial thought is that it has something to do with the distribution of open chromatin regions from each .bam. Did you use the snakemake pipeline? In that case, the input peak file is a merge of all regions of all 50 .bams (which is intended!). But that also means that not all regions are actually classified as "open" in all .bam-files. So what you might be seeing here are the low-scoring regions within peaks, that are not open in the biological condition.

These sites still receive footprint scores, since they are in the input peaks file, but they are too low to be assigned as bound. Why it is then trimodal and not bimodal... good question :-) But the first large pointy peak in each distribution is definitely "sites which were scored in this condition but are not actually open".

maxdudek commented 1 year ago

Ahh thank you that's very helpful! I did not use the snakemake pipeline, but I have been using the same peak file for all samples, which is a merge of all of the bams. Makes sense! I may want to rerun with sample-specific input peaks and see what happens then.

msbentsen commented 1 year ago

In general, you want to use .bed-files with all peaks in them, as this enables finding the "differential" peaks. If you only have peaks from one condition, the volcano-plots might be very skewed to one direction in the final plot. But for testing, yes it might be interesting to try it with the sample-specific input!