Closed Mestizia closed 11 months ago
Hi Vittorio, THanks for your feedbacks. It seems that your cool file contains multiple resolutions, to use just one resolution with chromosight, you can do something like this: chromosight detect --pattern=loops_small \ --threads=10 \ --min-dist=15000 \ --max-dist=2000000 \ 4DNFI81RQ431.mcool::/resolutions/10000 \ out_4DNFI81RQ431_10kb_loops_small`
where ::/resolutions/10000 is the resolution you want to use
Hi @Mestizia,
Thanks for reporting this issue, and for your thorough investigation!
To answer your question, max_dist
represents the longest interaction distance that will be retained throughout the analysis (this is done to reduce compute time and memory usage).
We indeed rely on the bin-size information to handle conversion between base-pairs and #diagonals, therefore chromosight is not expected to work with cool files that have a variable bin-size.
We should either:
If you feel like it, you are very welcome to propose a pull request for this and we will happily review it! Otherwise, we can handle this when time allows.
Thanks for the prompt reply to both of you.
@axelcournac Unfortunately I spoke too soon. The tool crashed anyway on a later step when the binsize comes back into play (see further down).
While I can generate fixed bin sizes with cooler zoomify and specify the resolution as described in your reply, the file still shows as having bin-type:"variable" and bin-size: null despite the fact that I am running cooler info
on the specific resolution version
I am assuming that zoomify is generating multiple fixed bin-size resolutions as described in their manual but this is not reflected/detected in cool info:
https://cooler.readthedocs.io/en/latest/schema.html#multi-resolution
A multi-resolution cooler file that contains multiple “coarsened” resolutions or “zoom-levels” derived from the same dataset. Multires cooler files should store each data collection underneath a group called /resolutions within a sub-group whose name is the bin size (e.g, XYZ.1000.mcool::resolutions/10000). **If the base cooler has variable-length bins, then use 1 to designate the base resolution, and the use coarsening multiplier (e.g. 2, 4, 8, etc.) to name the lower resolutions**. Conventional file extension: .mcool.
It may be wishful thinking but the wording seems to indicate that "non base resolution" files have fixed binsize.
Traceback (most recent call last):
File "/home/vtracann/miniconda3/envs/chromosight/bin/chromosight", line 10, in <module>
sys.exit(main())
File "/home/vtracann/miniconda3/envs/chromosight/lib/python3.10/site-packages/chromosight/cli/chromosight.py", line 1000, in main
cmd_detect(args)
File "/home/vtracann/miniconda3/envs/chromosight/lib/python3.10/site-packages/chromosight/cli/chromosight.py", line 807, in cmd_detect
separation_bins = int(cfg["min_separation"] // hic_genome.clr.binsize)
TypeError: unsupported operand type(s) for //: 'int' and 'NoneType'
Since I can determine the binsize from my previous work, my temporary solution would be to set the binsize to that specific value. From what I can tell, the bin size is either used as a denominator and the resulting value ends up being small enough that the tool automatically resort to defaults of 1 or as a numerator when it comes to "scanning distance" and the number ends up being large. Computationally, this is not a problem on my end and I don't need heuristics to speed it up (smaller scanning distance, if I am interpreting it correctly).
Do you think this temporary solution would introduce artifacts in the data?
Once this is clarified, I am up for making a pull request (with slightly cleaner solutions). Generally I would commend you and the team for the clean code. It was quite easy to navigate and debug on my end! I could also see this working better if cooler itself was able to tell when a variable binsize cool file is set to fixed binsize. I briefly looked in their repository but I couldn't find an immediate solution.
Cheers, Vittorio
If you use cooler zoomify on a file with variable bin size, the resulting bin sizes are still variable, it only pools them by factor 2. What we mean by "fixed" is that each bin represents a segment of the same length on the chromosome.
If your cool file has variable bin size, e.g. 1 bin = 1 restriction fragment, even if you zoomify it (1 bin = 2, 4, 8 ... restriction fragments), there would still not be a constant mapping between #bins and #basepairs. Thus many of our assumptions would fail and the calls would be unreliable. This is why we do not support variable bin sizes.
You can of course use this hack, if you know your matrix has a fixed resolution, but the clean solution is to build your matrix on a fixed bin size from the start, and it should be reflected in the metadata of the cool file.
@cmdoret Thank you for the clarification. I can look at the distribution of my bin_size lengths to verify that the spread is not significant. I suggest formally including in a step to check for variable bin size and explain the reasoning in chromosight. Are there plans in the future to support variable bin size or is it a "limitation" due to the algorithms you use? As far as I can tell, fixed bin size is not a characteristic of the data (due to the semi-random nature of restriction sites occurrence)
With that being said, you can close this issue. Many thanks to both of you. On the side, I will also try running chromosight with a fixed binsize matrix, I will write a report of how similar-different it ends up being in the context of the genome I am working with.
Thanks for the feedback and suggestions @Mestizia! :) I've open https://github.com/koszullab/chromosight/issues/67 to keep track of this.
It would in theory be possible to support variable bin size, but I might be forgetting something.
Hi, I am trying to compare the results I got from hicExplorer TAD prediction to chromosight. However, I am having trouble using my matrix. I exported the matrix to .cool as described in the instructions. However, i get the error:
I downloaded the example.cool file and ran it in parallel for debugging
I did some debugging on my end. And I got to the step where in the keep_distance function within contact_map.py. Here, self.max_dist is None. Therefore,
mat_max_dist = self.matrix.shape[0]+self.largest_kernel
(10607 + 3) This value gets moved around (goes as max_dist inpreprocess.detrend
) but at no pointself.max_dist
is set to a different value. At the end (the error message) is becausematrix.max_dist
is never changed from None and it gets fed tonumpy.tril
that throws the error.I checked my cool file with cooler info and the main difference compared to example.cool seems to be:
versus:
For the example.cool file. I am not 100% sure that is the origin but that is my best guess atm.
In order to make the tool work, I changed
self.max_dist = self.keep_distance
incontats_map.detrend
and it seems to work now. However, I am not sure what is the effect of this change on the overall results as I am not familiar with the role of max_dist in the overall tool. Do you think this workaround is likely to affect the final results in a major way?Cheers, Vittorio