open2c / cooltools

The tools for your .cool's
MIT License
135 stars 51 forks source link

Issue with cooltools.api.snipping.pileup() compatibility with bioframe/core/construction.py", line 150 #462

Open haleybianchi opened 1 year ago

haleybianchi commented 1 year ago

Hi There,

I am attempting to call cooltools.api.snipping.pileup(clr,features_df=peak_df , flank=300_000), where :

peak_df = pd.read_csv(f"ctcf.enhancer.promoters/{peak}.bed", sep="\t", names=['chrom', "start", "end"]), as per the cooltools feature_df requirements. I am not passing in a view_df, but when the API attempts to construct one, I get an error : File "conda/envs/matrix/lib/python3.8/site-packages/bioframe/core/construction.py", line 229, in make_viewframe view_df = from_any(regions, name_col=view_name_col, cols=cols) File "conda/envs/matrix/lib/python3.8/site-packages/bioframe/core/construction.py", line 150, in from_any raise ValueError(f"Unknown input format: {type(regions)}") ValueError: Unknown input format: <class 'NoneType'>

gfudenberg commented 1 year ago

not quite clear what is going on here-- can you: 1) print peak_df 2) print clr.info 3) see if the snipping tutorial notebook works with your environment

thanks!

haleybianchi commented 1 year ago

Hi, I was able to get the pileup to work after making a view_df with cooltools on genomic elements that are not CTCF, but when I do use CTCF, I always get the error

IndexError                                Traceback (most recent call last)
/tmp/ipykernel_3875287/3943304134.py in <module>
      2 dnase_overlap= overlap_dnase(ctcf)
      3 sides = 300_000
----> 4 stack= cooltools.pileup(clr,ctcf,view_df=view, flank=sides) #Expected output in stack should follow [xPos, yPox, nSites], while pileup() returns [nSites, xPos, yPox].
      5 # Aggregate. Note that some pixels might be converted to NaNs after IC, thus we aggregate by nanmean:
      6 mtx = np.nanmean(stack, axis=2)

/data/bianchiah/mymamba/envs/mambamatrix/lib/python3.7/site-packages/cooltools/api/snipping.py in pileup(clr, features_df, view_df, expected_df, expected_value_col, flank, min_diag, clr_weight_name, nproc)
    819     else:
    820         mymap = map
--> 821     stack = pileup_legacy(features_df, snipper.select, snipper.snip, map=mymap)
    822     if feature_type == "bed":
    823         stack = np.nansum([stack, np.transpose(stack, axes=(1, 0, 2))], axis=0)

/data/bianchiah/mymamba/envs/mambamatrix/lib/python3.7/site-packages/cooltools/api/snipping.py in pileup_legacy(features, data_select, data_snip, map)
    243             partial(_pileup, data_select, data_snip),
    244             # Note that unannotated regions will form a separate group
--> 245             features.groupby("region", sort=False),
    246         )
    247     )

/data/bianchiah/mymamba/envs/mambamatrix/lib/python3.7/site-packages/cooltools/api/snipping.py in _pileup(data_select, data_snip, arg)
    199 
    200     data = data_select(region1, region2)
--> 201     stack = list(map(partial(data_snip, data, region1, region2), zip(s1, e1, s2, e2)))
    202 
    203     return np.dstack(stack), feature_group["_rank"].values

/data/bianchiah/mymamba/envs/mambamatrix/lib/python3.7/site-packages/cooltools/api/snipping.py in snip(self, matrix, region1, region2, tup)
    413         if self.min_diag is not None:
    414             D = self.diag_indicators[region1][lo1:hi1, lo2:hi2] < self.min_diag
--> 415             snippet[D] = np.nan
    416         return snippet
    417 

IndexError: boolean index did not match indexed array along dimension 0; dimension is 61 but corresponding boolean dimension is 2

I tried clustering the CTCF sites by the resolution, but the error persisted

gfudenberg commented 1 year ago

views are generally used for specifying things like chromosomes or chromosome arms-- see cell [7] of the tutorial: https://cooltools.readthedocs.io/en/latest/notebooks/pileup_CTCF.html

its not clear what your ctcf data frame is: does it have chrom, start, end, as in the tutorial?

haleybianchi commented 12 months ago

Yes, it does. for the view: view= cooltools.lib.common.make_cooler_view(clr, ucsc_names=False) #Generate a full chromosome viewframe using cooler’s chromsizes for the clr.info: {'bin-size': 5000, 'bin-type': 'fixed', 'creation-date': '2020-08-16T14:48:11.708578', 'format': 'HDF5::Cooler', 'format-url': 'https://github.com/mirnylab/cooler', 'format-version': 3, 'generated-by': 'cooler-0.8.5', 'genome-assembly': 'unknown', 'metadata': {}, 'nbins': 545118, 'nchroms': 22, 'nnz': 156203230, 'storage-mode': 'symmetric-upper', 'sum': 572517229} I tried using another bedfile I generated from deeptools multibigwig summary using 5kb bins, and I am getting a similar error: `--------------------------------------------------------------------------- IndexError Traceback (most recent call last) c:\Users\BIANCH~1\AppData\Local\Temp\scp14986\vf\users\SenLab\shared\Haley\EPinteractions\scripts\transcription.length.ipynb Cell 33 line 1 13 top[["chrom", "start", "end"]] 15 sides = 300_000 ---> 16 stack= cooltools.pileup(clr=clr,features_df=top,view_df=view, flank=sides) 17 # make_matrix(top, "Top 5% Regions with Bidirectional Nascent RNA") 18 # make_matrix(bottom, "Bottom 5% Regions Bidirectional Nascent RNA") 19 # pos_only = binned_rna[binned_rna.strand_neg == 0] 20 # make_matrix(pos_only, "Regions with only Positive Strand Nascent RNA")

File /data/bianchiah/mymamba/envs/mambamatrix2/lib/python3.8/site-packages/cooltools/api/snipping.py:986, in pileup(clr, features_df, view_df, expected_df, expected_value_col, flank, min_diag, clr_weight_name, nproc) 984 else: 985 mymap = map --> 986 stack = _pileup(features_df, snipper.select, snipper.snip, map=mymap) 987 if feature_type == "bed": 988 stack = np.nansum([stack, np.transpose(stack, axes=(1, 0, 2))], axis=0)

File /data/bianchiah/mymamba/envs/mambamatrix2/lib/python3.8/site-packages/cooltools/api/snipping.py:275, in _pileup(features, data_select, data_snip, map) 271 features["_rank"] = range(len(features)) 273 # cumul_stack = [] 274 # orig_rank = [] --> 275 cumul_stack, orig_rank = zip( 276 *map( 277 partial(_extract_stack, data_select, data_snip), ... 463 D = self.diag_indicators[region1][lo1:hi1, lo2:hi2] < self.min_diag --> 464 snippet[D] = np.nan 465 return snippet

IndexError: boolean index did not match indexed array along dimension 0; dimension is 121 but corresponding boolean dimension is 4 Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...`

gfudenberg commented 8 months ago

I wonder if some of your snips are going out of bounds-- perhaps you can check if all of your sites are greater than the flank distance from chromosome starts/ends?