open2c / cooltools

The tools for your .cool's
MIT License
132 stars 50 forks source link

Feature suggestion: restoring NaNs to sparse matrix-derived array #370

Open efriman opened 2 years ago

efriman commented 2 years ago

Thanks for you great (and cool) tools!

A convenience feature would be to have a function that can restore nans to an array derived from a sparse matrix using the balance weights of the cooler, similar to what happens in snipping.CoolerSnipper.select.

Elias

sergpolly commented 7 months ago

@efriman could you please describe your use case, how does this NaN restoration problem arise in your case ?

I also realized that this NaN restoration has to be done as an extra step working on the obs/exp fetcher #486 , and looking into cooler's data-query code

efriman commented 7 months ago

Hi @sergpolly. It's been so long I can hardly remember why I asked for this! But I think it was related to what's happening here: https://github.com/open2c/coolpuppy/blob/c27f509d2474a9fa92217f28b27312ce5ae47c96/coolpuppy/coolpup.py#L1084 and then used in line 1125.

Basically we're getting the sparse matrix for the region (whole chromosome(s)) and picking up a section of interest but losing nans along the way so need this kind of convoluted operation to restore. Hope it makes sense.

sergpolly commented 7 months ago

I see - understood - so it is exactly the same - "sparse -> dense data fetching" issue ... So far this exact pattern happens in (afaik):

to me this sounds like a special subclass of a sparse matrix, that has an additional property - "balancing-weights" - or "bad_bins mask", something like that - and its to_dense() or to_array() methods are modified such that they can apply these masks/weights on the fly - not sure if it is worth the effort to generalize it this way..

There is an idea we can reject right away i think - can't we ? -> "shoving" those NaNs directly into the sparse matrix (not sure if scipy coo even support nans), it sounds counter-productive - i.e. bad bins occupy entire rows/columns of Hi-C derived matrices - and filling them explicitly would "kill" sparsity and efficiency ...

Overall, to me this sounds more like a cooler's "problem" or a "problem" of "whatever" is doing the data fetching part ... I think this would require broader discussion to see if it is worth generalizing - not sure what others think