HicMatrix generated cool file not supported - possible solution

Mestizia commented 11 months ago

Hi, I am trying to compare the results I got from hicExplorer TAD prediction to chromosight. However, I am having trouble using my matrix. I exported the matrix to .cool as described in the instructions. However, i get the error:

"""
Traceback (most recent call last):
  File "/home/vtracann/miniconda3/envs/chromosight/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/vtracann/miniconda3/envs/chromosight/lib/python3.10/site-packages/chromosight/cli/chromosight.py", line 610, in _detect_sub_mat
    chrom_patterns, chrom_windows = cid.pattern_detector(
  File "/home/vtracann/miniconda3/envs/chromosight/lib/python3.10/site-packages/chromosight/utils/detection.py", line 273, in pattern_detector
    mat_conv = preproc.diag_trim(mat_conv.tocsr(), contact_map.max_dist)
  File "/home/vtracann/miniconda3/envs/chromosight/lib/python3.10/site-packages/chromosight/utils/preprocessing.py", line 119, in diag_trim
    trimmed = sp.tril(mat, n, format="csr")
  File "/home/vtracann/miniconda3/envs/chromosight/lib/python3.10/site-packages/scipy/sparse/_extract.py", line 100, in tril
    mask = A.row + k >= A.col
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
"""

I downloaded the example.cool file and ran it in parallel for debugging

I did some debugging on my end. And I got to the step where in the keep_distance function within contact_map.py. Here, self.max_dist is None. Therefore, mat_max_dist = self.matrix.shape[0]+self.largest_kernel (10607 + 3) This value gets moved around (goes as max_dist in preprocess.detrend) but at no point self.max_dist is set to a different value. At the end (the error message) is because matrix.max_dist is never changed from None and it gets fed to numpy.tril that throws the error.

I checked my cool file with cooler info and the main difference compared to example.cool seems to be:

"bin-size": null,
"bin-type": "variable",

versus:

"bin-size": 1000,
"bin-type": "fixed",

For the example.cool file. I am not 100% sure that is the origin but that is my best guess atm.

In order to make the tool work, I changed self.max_dist = self.keep_distance in contats_map.detrend and it seems to work now. However, I am not sure what is the effect of this change on the overall results as I am not familiar with the role of max_dist in the overall tool. Do you think this workaround is likely to affect the final results in a major way?

Cheers, Vittorio

axelcournac commented 11 months ago

Hi Vittorio, THanks for your feedbacks. It seems that your cool file contains multiple resolutions, to use just one resolution with chromosight, you can do something like this: chromosight detect --pattern=loops_small \ --threads=10 \ --min-dist=15000 \ --max-dist=2000000 \ 4DNFI81RQ431.mcool::/resolutions/10000 \ out_4DNFI81RQ431_10kb_loops_small`

where ::/resolutions/10000 is the resolution you want to use

cmdoret commented 11 months ago

Hi @Mestizia,

Thanks for reporting this issue, and for your thorough investigation!

To answer your question, max_dist represents the longest interaction distance that will be retained throughout the analysis (this is done to reduce compute time and memory usage).

We indeed rely on the bin-size information to handle conversion between base-pairs and #diagonals, therefore chromosight is not expected to work with cool files that have a variable bin-size.

We should either:

Fail instantly with an informative error message when that is the case.
Allow specifying the resolution if not in the cool file

If you feel like it, you are very welcome to propose a pull request for this and we will happily review it! Otherwise, we can handle this when time allows.

Mestizia commented 11 months ago

Thanks for the prompt reply to both of you. @axelcournac Unfortunately I spoke too soon. The tool crashed anyway on a later step when the binsize comes back into play (see further down). While I can generate fixed bin sizes with cooler zoomify and specify the resolution as described in your reply, the file still shows as having bin-type:"variable" and bin-size: null despite the fact that I am running cooler info on the specific resolution version I am assuming that zoomify is generating multiple fixed bin-size resolutions as described in their manual but this is not reflected/detected in cool info: https://cooler.readthedocs.io/en/latest/schema.html#multi-resolution A multi-resolution cooler file that contains multiple “coarsened” resolutions or “zoom-levels” derived from the same dataset. Multires cooler files should store each data collection underneath a group called /resolutions within a sub-group whose name is the bin size (e.g, XYZ.1000.mcool::resolutions/10000). **If the base cooler has variable-length bins, then use 1 to designate the base resolution, and the use coarsening multiplier (e.g. 2, 4, 8, etc.) to name the lower resolutions**. Conventional file extension: .mcool. It may be wishful thinking but the wording seems to indicate that "non base resolution" files have fixed binsize.

Traceback (most recent call last):
  File "/home/vtracann/miniconda3/envs/chromosight/bin/chromosight", line 10, in <module>
    sys.exit(main())
  File "/home/vtracann/miniconda3/envs/chromosight/lib/python3.10/site-packages/chromosight/cli/chromosight.py", line 1000, in main
    cmd_detect(args)
  File "/home/vtracann/miniconda3/envs/chromosight/lib/python3.10/site-packages/chromosight/cli/chromosight.py", line 807, in cmd_detect
    separation_bins = int(cfg["min_separation"] // hic_genome.clr.binsize)
TypeError: unsupported operand type(s) for //: 'int' and 'NoneType'

Since I can determine the binsize from my previous work, my temporary solution would be to set the binsize to that specific value. From what I can tell, the bin size is either used as a denominator and the resulting value ends up being small enough that the tool automatically resort to defaults of 1 or as a numerator when it comes to "scanning distance" and the number ends up being large. Computationally, this is not a problem on my end and I don't need heuristics to speed it up (smaller scanning distance, if I am interpreting it correctly).

Do you think this temporary solution would introduce artifacts in the data?

Once this is clarified, I am up for making a pull request (with slightly cleaner solutions). Generally I would commend you and the team for the clean code. It was quite easy to navigate and debug on my end! I could also see this working better if cooler itself was able to tell when a variable binsize cool file is set to fixed binsize. I briefly looked in their repository but I couldn't find an immediate solution.

Cheers, Vittorio

cmdoret commented 11 months ago

If you use cooler zoomify on a file with variable bin size, the resulting bin sizes are still variable, it only pools them by factor 2. What we mean by "fixed" is that each bin represents a segment of the same length on the chromosome.

If your cool file has variable bin size, e.g. 1 bin = 1 restriction fragment, even if you zoomify it (1 bin = 2, 4, 8 ... restriction fragments), there would still not be a constant mapping between #bins and #basepairs. Thus many of our assumptions would fail and the calls would be unreliable. This is why we do not support variable bin sizes.

You can of course use this hack, if you know your matrix has a fixed resolution, but the clean solution is to build your matrix on a fixed bin size from the start, and it should be reflected in the metadata of the cool file.

Mestizia commented 11 months ago

@cmdoret Thank you for the clarification. I can look at the distribution of my bin_size lengths to verify that the spread is not significant. I suggest formally including in a step to check for variable bin size and explain the reasoning in chromosight. Are there plans in the future to support variable bin size or is it a "limitation" due to the algorithms you use? As far as I can tell, fixed bin size is not a characteristic of the data (due to the semi-random nature of restriction sites occurrence)

With that being said, you can close this issue. Many thanks to both of you. On the side, I will also try running chromosight with a fixed binsize matrix, I will write a report of how similar-different it ends up being in the context of the genome I am working with.

cmdoret commented 11 months ago

Thanks for the feedback and suggestions @Mestizia! :) I've open https://github.com/koszullab/chromosight/issues/67 to keep track of this.

It would in theory be possible to support variable bin size, but I might be forgetting something.

koszullab / chromosight

HicMatrix generated cool file not supported - possible solution #66