stjude-biohackathon / CRCminer

MIT License
2 stars 1 forks source link

Consider methods for valley/subpeak finding from H3K27ac signal data. #18

Open j-andrews7 opened 1 year ago

j-andrews7 commented 1 year ago

The idea is that nucleosome free regions are the ones that actually get bound by TFs, so identifying them and limiting to them for motif scanning improves accuracy. See this paper or this one for references that this method makes sense and improves things. Our initial testing just limited to ATAC peaks, which is a more straightforward way to get this info. However, many datasets will not have that modality, so an alternative approach using just the H3K27ac is useful.

Generally, these will be easier to find in MNase-treated samples (CUT&RUN, etc), but you can also do so on typical ChIP samples as well. This would be a replacement for the bamliquidator dependency and internal valley scoring/calling done in the CRC tool.

Implementation options:

j-andrews7 commented 1 year ago

Super basic starting point for self-implementation of this:

import pysam
import numpy as np
from scipy.signal import argrelextrema

def find_local_minima(bam_file, chromosome, start, end):
    """
    Find local minima (valleys) between high signal regions for a specified range in a BAM file.

    :param bam_file: path to the input BAM file
    :type bam_file: str
    :param chromosome: chromosome identifier (e.g., 'chr1')
    :type chromosome: str
    :param start: start position of the specified range (0-based)
    :type start: int
    :param end: end position of the specified range (0-based)
    :type end: int
    :return: list of local minima positions in the specified range
    :rtype: list of int
    """
    # Read the BAM file
    alignment_file = pysam.AlignmentFile(bam_file, "rb")

    # Calculate read coverage
    coverage = np.zeros(end - start + 1)
    for read in alignment_file.fetch(chromosome, start, end):
        read_start = max(read.reference_start, start)
        read_end = min(read.reference_end, end)
        coverage[read_start - start : read_end - start] += 1

    # Find local minima (valleys)
    local_minima = argrelextrema(coverage, np.less)

    # Convert local minima positions to the original coordinate system
    minima_positions = [start + minima for minima in local_minima[0]]

    return minima_positions

bam_file = "path/to/your/bamfile.bam"
chromosome = "chr1"
start = 1000
end = 2000

local_minima = find_local_minima(bam_file, chromosome, start, end)

print("Local minima (valleys) positions:")
for minima in local_minima:
    print(minima)

The test would be to see how well these overlap with ATAC peaks when calculated based only on the H3K27ac data in SE regions. May need to add extra bases (+/- 100 bp) on each side of designated points.