open2c / cooltools

The tools for your .cool's
MIT License
138 stars 51 forks source link

Downsampling and cis counts #458

Open ddepierre opened 1 year ago

ddepierre commented 1 year ago

Hi, A part of this bug has already been reported here, I have the same issue on cooltools 0.5.4 version.

Also while writing this issue, I realized I had maybe too many questions for one post, sorry about that.

I have some cool files I want to downsample to a given cis contact count for all my replicates and different conditions to have the same number of cis contacts, so I can compare them (first of all, am I right on the usage of downsampling as a way to normalize sample coverage between conditions?)

1/ get the cis contact count

With cooltools.coverage() or with expected_cis()


cooltools.coverage(some_cool_file, store=True, ignore_diags=0)
some_cool_file.info['cis']
# >>> 309426839

cooltools.coverage(some_cool_file, store=True)
some_cool_file.info['cis']
# >>> 210709980 # because --ignore-diags = 2 in cooler balance 

np.sum(some_cool_file.bins()[:]['cov_cis_raw'])
# >>> 618853678
np.sum(some_cool_file.bins()[:]['cov_cis_raw'])/2
# >>> 309426839.0

expected_temp = cooltools.expected_cis(some_cool_file, clr_weight_name = None, ignore_diags=0, view_df=gnm_arms, chunksize=1000000, nproc = nproc, smooth=False, aggregate_smoothed=False)
np.sum(expected_temp['count.sum'])
# >>> 309426839.0

Counts doubled in cov_cis_raw as already reported here

2 / Downsampling

I have unbalanced cool file as


some_cool_file.bins()[:]
       chrom     start       end  cov_cis_raw  cov_tot_raw
0       chr1         0     10000            0            0
1       chr1     10000     20000            0            0
2       chr1     20000     30000            0            0
3       chr1     30000     40000            0            0
4       chr1     40000     50000            0            0
...      ...       ...       ...          ...          ...
272561  chrY  91720000  91730000            0            0
272562  chrY  91730000  91740000            0            0
272563  chrY  91740000  91744698            0            0
272564  chrM         0     10000          258         3197
272565  chrM     10000     16299          270         3362

[272566 rows x 5 columns]

That I want to downsample:

cooltools.sample(clr = some_cool_file, out_clr_path = dnsmpl_cool_file, cis_count = 250000000)

Traceback (most recent call last):
  File "/fs/gpfs06/lv05/fileset02/pool/pool-totipotency-software/hpcl67/cooltools/cooltools_0.5.4_venv/lib/python3.9/site-packages/cooltools/api/coverage.py", line 117, in coverage
    else clr._load_attrs(clr.root.rstrip("/") + "/bins/weight")["ignore_diags"]
  File "/fs/gpfs06/lv05/fileset02/pool/pool-totipotency-software/hpcl67/cooltools/cooltools_0.5.4_venv/lib/python3.9/site-packages/cooler/api.py", line 116, in _load_attrs
    return dict(grp[path].attrs)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/fs/gpfs06/lv05/fileset02/pool/pool-totipotency-software/hpcl67/cooltools/cooltools_0.5.4_venv/lib/python3.9/site-packages/h5py/_hl/group.py", line 357, in __getitem__
    oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: "Unable to open object (object 'weight' doesn't exist)"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/fs/gpfs06/lv05/fileset02/pool/pool-totipotency-software/hpcl67/cooltools/cooltools_0.5.4_venv/lib/python3.9/site-packages/cooltools/api/sample.py", line 102, in sample
    cis_total = clr.info.get("cis", np.sum(coverage(clr)[0] // 2, dtype=int))
  File "/fs/gpfs06/lv05/fileset02/pool/pool-totipotency-software/hpcl67/cooltools/cooltools_0.5.4_venv/lib/python3.9/site-packages/cooltools/api/coverage.py", line 120, in coverage
    raise ValueError(
ValueError: Please, specify ignore_diags and/or IC balance this cooler! Cannot access the value used in IC balancing. 

- Is it possible to sample contact on raw matrices? What is the point of doing downsampling after balancing?

Also

ValueError: Please, specify ignore_diags and/or IC balance this cooler! Cannot access the value used in IC balancing. '

But ignore_diags is not an argument from sample()

 cooltools.sample(clr = cooler_list[ndx], out_clr_path = dnsmpl_cool_file, cis_count = cis_contact_trgt, ignore_diags=0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: sample() got an unexpected keyword argument 'ignore_diags'

- To bypass the balancing in sample(), can I duplicate 'cov_cis_raw' columns in .bins() and call it 'weight' so it fakes balance cool file and downsample on raw cis contacts? And then I'll need to balance it anyway after the downsampling.

- Usually, is downsampling done on full matrix or do you ignore first(s) diagonal for some reason?

- additional question:

store (bool, optional) – If True, store the results in the input cooler file when finished. Does it mean that the result is stored in the python variable or directly modified in the cool file? I am not sure to understand whether the matrices are loaded and only loaded version is modified or if the original file it-self is modified.

Thanks for the support, David


# session_info.show()
# -----
# bbi                 0.3.5
# bioframe            0.4.1
# click               8.1.4
# cooler              0.9.2
# coolpuppy           NA
# cooltools           0.5.4
# cytoolz             0.12.1
# ipywidgets          8.0.7
# matplotlib          3.7.2
# mpl_toolkits        NA
# numpy               1.24.4
# pandas              1.5.3
# pysam               0.21.0
# scipy               1.11.1
# seaborn             0.12.2
# session_info        1.0.0
# -----
# Python 3.9.5 (default, Jun  4 2021, 12:28:51) [GCC 7.5.0]
# Linux-5.3.18-150300.59.60-default-x86_64-with-glibc2.31
# -----
# Session information updated at 2023-07-28 12:37
Phlya commented 1 year ago

These are good questions, although there is a lot to unpack here. I think the most important idea I can give you is to calculate coverage with ignore_diags=0, I think it should solve all your problems...

The values are indeed stored in the file, and .sample() uses the cis count when it is available in the file. When not, it calculates the coverage with default arguments (! perhaps not ideal and should be changed to ignore_diags=0 !) and uses that.

gfudenberg commented 1 year ago

discussion 2023/10/9: perhaps we should deprecate the usage of stored cis counts in sample(), as it is not obvious how to link this stored value with the number of diagonals ignored in previous coverage() calculation. https://github.com/open2c/cooltools/blob/0a0d8417099f182e1e24b3897fdf41de6b08844a/cooltools/api/sample.py#L100