open2c / cooltools

The tools for your .cool's
MIT License
132 stars 50 forks source link

Issue with scoring_and_histogramming_step: ValueError: There are la_exp.donut.value in (8192.0, inf], please check the histogram #511

Open satvikgunjala opened 5 months ago

satvikgunjala commented 5 months ago

Hello, I am relatively new to Hi-C analysis. I am currently attempting to use the dots() function as utilized in https://github.com/open2c/open2c_examples/blob/master/dots.ipynb to identify enriched points in the Hi-C contact matrix. For most of the .mcool contact matricies I am using, I am able to obtain the dataframe. However, for certain .mcool files, I encountered this error:

Screenshot 2024-03-28 at 4 07 52 AM

Could someone explain what this error means and how do I fix it or get around it?

gfudenberg commented 5 months ago

it looks like there might be an issue with the dataframe of expected passed to dots() (e.g. 0s at some distances due to an overly sparse dataset). If you share the function call that may provide some more insight

satvikgunjala commented 5 months ago

This is the code that I used based on the example from https://github.com/open2c/open2c_examples/blob/master/dots.ipynb:


import pandas as pd
import numpy as np
from itertools import chain
import os

# Hi-C utilities imports: 
import cooler
import bioframe
import cooltools
from cooltools.lib.numutils import fill_diag
from packaging import version
if version.parse(cooltools.__version__) < version.parse('0.5.2'):
    raise AssertionError("tutorials rely on cooltools version 0.5.2 or higher,"+
                         "please check your cooltools version and update to the latest")

# Visualization imports: 
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
import matplotlib.patches as patches
from matplotlib.ticker import EngFormatter

# helper functions for plotting
bp_formatter = EngFormatter('b')
def format_ticks(ax, x=True, y=True, rotate=True):
    """format ticks with genomic coordinates as human readable"""
    if y:
        ax.yaxis.set_major_formatter(bp_formatter)
    if x:
        ax.xaxis.set_major_formatter(bp_formatter)
        ax.xaxis.tick_bottom()
    if rotate:
        ax.tick_params(axis='x',rotation=45)

# define genomic view that will be used to call dots and pre-compute expected

# Use bioframe to fetch the genomic features from the UCSC.
hg38_chromsizes = bioframe.fetch_chromsizes('hg38')
hg38_cens = bioframe.fetch_centromeres('hg38')
hg38_arms = bioframe.make_chromarms(hg38_chromsizes, hg38_cens)

# Set directory where matricies are located
data_dir = './data/Matricies'

matricies = os.listdir(data_dir)

mcools = []
for x in matricies:
    if x.endswith(".mcool"):
        mcools.append(x)

# 10 kb is a resolution at which one can clearly see "dots":
binsize = 10_000

for y in mcools:

    # Open cool file with Micro-C data:
    clr = cooler.Cooler(f'{data_dir}/{y}::/resolutions/{binsize}')

    # Select only chromosomes that are present in the cooler. 
    hg38_arms = hg38_arms.set_index("chrom").loc[clr.chromnames].reset_index()

    # intra-arm expected
    expected = cooltools.expected_cis(
        clr,
        view_df=hg38_arms,
        nproc=4,
    )

    print(f'Processing: {y}')

    dots_df = cooltools.dots(
        clr,
        expected=expected,
        view_df=hg38_arms,
        # how far from the main diagonal to call dots:
        max_loci_separation=10_000_000,
        nproc=4,
    )

    y_name = y.replace(".mcool","")
    dots_df.to_csv(r'./data/Loops/'+f'{y_name}.csv')
gfudenberg commented 5 months ago

@satvikgunjala if you run cooltools.expected_cis with kwarg clr_weight_name=None can you send the resulting plot? It would be useful to know at what distance does the average number of counts drop below 1.

(for reference about contacts vs distance you can check out: https://cooltools.readthedocs.io/en/latest/notebooks/contacts_vs_distance.html#Impact-of-sequencing-depth-on-computed-P(s))