Open j-andrews7 opened 1 year ago
Super basic starting point for self-implementation of this:
import pysam
import numpy as np
from scipy.signal import argrelextrema
def find_local_minima(bam_file, chromosome, start, end):
"""
Find local minima (valleys) between high signal regions for a specified range in a BAM file.
:param bam_file: path to the input BAM file
:type bam_file: str
:param chromosome: chromosome identifier (e.g., 'chr1')
:type chromosome: str
:param start: start position of the specified range (0-based)
:type start: int
:param end: end position of the specified range (0-based)
:type end: int
:return: list of local minima positions in the specified range
:rtype: list of int
"""
# Read the BAM file
alignment_file = pysam.AlignmentFile(bam_file, "rb")
# Calculate read coverage
coverage = np.zeros(end - start + 1)
for read in alignment_file.fetch(chromosome, start, end):
read_start = max(read.reference_start, start)
read_end = min(read.reference_end, end)
coverage[read_start - start : read_end - start] += 1
# Find local minima (valleys)
local_minima = argrelextrema(coverage, np.less)
# Convert local minima positions to the original coordinate system
minima_positions = [start + minima for minima in local_minima[0]]
return minima_positions
bam_file = "path/to/your/bamfile.bam"
chromosome = "chr1"
start = 1000
end = 2000
local_minima = find_local_minima(bam_file, chromosome, start, end)
print("Local minima (valleys) positions:")
for minima in local_minima:
print(minima)
The test would be to see how well these overlap with ATAC peaks when calculated based only on the H3K27ac data in SE regions. May need to add extra bases (+/- 100 bp) on each side of designated points.
The idea is that nucleosome free regions are the ones that actually get bound by TFs, so identifying them and limiting to them for motif scanning improves accuracy. See this paper or this one for references that this method makes sense and improves things. Our initial testing just limited to ATAC peaks, which is a more straightforward way to get this info. However, many datasets will not have that modality, so an alternative approach using just the H3K27ac is useful.
Generally, these will be easier to find in MNase-treated samples (CUT&RUN, etc), but you can also do so on typical ChIP samples as well. This would be a replacement for the
bamliquidator
dependency and internal valley scoring/calling done in the CRC tool.Implementation options:
nfr
flag in HOMER - this is also a simple approach we could try to implement ourselves on the BAM file(s) in another CRCminer command, likely using pysam to manipulate the BAM and get depth at peak regions, then try to find local minima in each region.