paulsengroup / hictkpy

Python bindings for hictk: read and write .cool and .hic files directly from Python
MIT License
9 stars 0 forks source link

`htk.fetch().to_numpy()` sometimes gives incorrect result when two regions don't overlap #73

Open i-pletenev opened 3 weeks ago

i-pletenev commented 3 weeks ago

Hi! In order to check consistency between cooler.Cooler().matrix().fetch() and htk.fetch().to_numpy() I sampled a few random regions from cooler_test_file.mcool and looked whether the result is identical. I found that in minority of cases the output differs even when two queried regions don't overlap. Here's an example:

In [1]: import cooler

In [2]: import hictkpy as htk

In [3]: htk.__version__ # same thing for version from github
Out[3]: '0.0.5'

In [4]: clr_path = "cooler_test_file.mcool"

In [5]: resolution = 100_000

In [6]: reg1, reg2 = 'chr2R:2200000-2700000', 'chr2R:23300000-23800000'

In [7]: clr_out = cooler.Cooler(clr_path + f"::resolutions/{resolution}").matrix(balance=False).fetch(reg1, reg2)

In [8]: clr_out
Out[8]: 
array([[ 8,  1,  5,  7,  3],
       [ 9,  6,  1,  5,  2],
       [ 1,  0,  2,  0,  0],
       [ 9,  4,  9, 10, 11],
       [ 5,  2,  9, 10,  8]], dtype=int32)

In [9]: htk_out = htk.File(clr_path, resolution).fetch(reg1, reg2, normalization='NONE').to_numpy()

In [10]: htk_out
Out[10]: 
array([[ 8,  1,  5,  7,  3],
       [ 9,  6,  1,  5,  2],
       [ 1,  1,  2,  0,  0],
       [ 9,  4,  9, 10, 11],
       [ 5,  2,  9, 10,  8]], dtype=int32)

In [11]: clr_out == htk_out
Out[11]: 
array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True, False,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]])

I decided to create a second issue, because there could be a different bug here than in #72

Once again, I checked it in both 0.0.5 version and 0.0.6.dev50+g6eca918 version from github.

robomics commented 3 weeks ago

Thanks for opening the issues!

I think the bug you are describing is the same I ran into yesterday while fuzzying hictk with the new fuzzer (see https://github.com/paulsengroup/hictk/pull/226 https://github.com/paulsengroup/hictk/pull/227 if you are curious).

If it is the same bug, then the bug should only occur for cis queries where the end coordinate of range2 is greater than the end coordinate of range1.

The bug is due to an error in the code that mirrors interactions from the upper-triangle into the lower triangle.

I will try to write a bugfix by the end of the week. Will ping you here once it is ready :)

Thanks again for opening the issues!

i-pletenev commented 3 weeks ago

Great, thank you for quick response!