Closed sergpolly closed 3 years ago
An entire Snakefile
that I'd use to call dots at 5 and 10kb and combine for a bunch of samples/coolers.
Broken down and commented:
you'd typically need ~10kb and 5kb coolers to perform GM-level dot-calling also, describe your samples here - i.e. ./path(1,2)/your-sample(1,2).cool must exist
resolutions = [5000,10000]
names = ["your-sample1","your-sample2"]
paths = ["path1", "path2"]
the snake-thing about wildcards ...
wildcard_constraints:
res="\d+"
get the combined bedpe lists with the dots
rule all:
input:
expand("{path}/combineddots/final_cloops_{hic_name}.combined.bedpe",zip,path=paths,hic_name=names)
force rebalance - in case you'd want cis only, modify tolerance, diags, whatever ...
rule balance:
input:
"{path}/{hic_name}.{res}.cool"
output:
touch("{path}/touchtmp/{hic_name}.{res}.touch")
threads: 9
shell:
"cooler balance -p {threads} --ignore-diags 1 --force {input}"
expected ...
rule compute_expected:
input:
"{path}/{hic_name}.{res}.cool",
"{path}/touchtmp/{hic_name}.{res}.touch"
output:
"{path}/expected/{hic_name}.{res}.cool.cis.expected.ALL"
threads: 9
shell:
"cooltools compute_expected -p {threads} --drop-diags 1 {input[0]} > {output}"
chrM and sometimes chrY screw up dot calling - remove em from expected so that dot-caller won't use them
rule clean_expected:
input:
"{path}/expected/{hic_name}.{res}.cool.cis.expected.ALL"
output:
"{path}/expected/{hic_name}.{res}.cool.cis.expected"
shell:
"grep -v -e \"^chrM\" {input} | grep -v -e \"^chrY\" > {output}"
dot-calling for each sample/resolution ...
rule call_dots:
input:
cooler = "{path}/{hic_name}.{res}.cool",
expected = "{path}/expected/{hic_name}.{res}.cool.cis.expected"
params:
fdr = 0.1,
diag_width = 10000000,
tile_size = 10000000
output:
signif_dots="{path}/dots/cloops_{hic_name}.{res}.bedpe",
filtered_dots="{path}/dots/final_cloops_{hic_name}.{res}.bedpe"
threads: 17
shell:
"cooltools call_dots -n {threads} "
" -o {output.signif_dots} -v --fdr {params.fdr} "
" --max-nans-tolerated 7 "
" --max-loci-separation {params.diag_width} "
" --dots-clustering-radius 21000 "
" --tile-size {params.tile_size} "
" {input.cooler} {input.expected}"
merge dots using little unfinished script that should probably end up in cooltools at some point https://github.com/sergpolly/random_tools
rule merge_dots_res:
input:
expand("{{path}}/dots/final_cloops_{{hic_name}}.{res}.bedpe",res=resolutions)
params:
radius = 10000
output:
"{path}/combineddots/final_cloops_{hic_name}.combined.bedpe"
shell:
"python ./random_tools/merge_dot_lists.py "
" --radius {params.radius} -v "
" --output {output} {input}"
in case anyone needs this - to prepare multires bedpe for higlass ingestion ...
# rule prepare_multires:
# input:
# "final_cloops_{hic_name}.combined.bedpe"
# output:
# "final_cloops_{hic_name}.combined.bedpe.multires"
# params:
# genome = "hg19"
# shell:
# "clodius aggregate bedpe "
# " -a {params.genome} --has-header "
# " --chr1-col 1 --from1-col 2 --to1-col 3 "
# " --chr2-col 4 --from2-col 5 --to2-col 6 "
# " --output-file {output} {input} "
FYI the actual command that works for me is cooltools call-dots
, i.e. not underscore, but dash... Not sure why this happens. Seems like with the new installation all underscores were converted to dashes, like in compute-expected
etc as well.
Feature request: specify which chromosome(s) or regions to use, it would be convenient to quickly test on just a small subset.
Also, -s, --output-scores
is a required argument, but not used in the examples here (maybe it just doesn't need to be required and it's a bug?).
Also, would be curious to see rough timings for different resolutions/datasets/numbers of cores, and how much memory is required.
And I can't get it to work... Here is the error:
I call it like this:
#!/bin/sh
# Grid Engine options (lines prefixed with #$)
#$ -cwd
#$ -N call_dots
#$ -l h_rt=12:00:00
#$ -l h_vmem=32G
#$ -pe sharedmem 4
#$ -j yes
#$ -V
cooltools call-dots -n 4 \
-s ../output/dots/$(basename $1)\_10kb_scores.hdf5 \
-o ../output/dots/$(basename $1)\_10kb_dots.bedpe \
-v --fdr 0.1 \
--max-nans-tolerated 7 \
--max-loci-separation 10000000 \
--dots-clustering-radius 21000 \
--tile-size 5000000 \
$1 $1\_expected.tsv
Output:
Tokenization took: 128.85 ms
Type conversion took: 48.86 ms
Parser memory cleanup took: 0.01 ms
Tokenization took: 105.19 ms
Type conversion took: 33.80 ms
Parser memory cleanup took: 0.01 ms
Tokenization took: 0.72 ms
Type conversion took: 0.78 ms
Parser memory cleanup took: 0.00 ms
../coolers/2i.10000.cool and ../coolers/2i.10000.cool_expected.tsv passed cross-compatibility checks.
Kernels parameters are set as w,p=5,2 for the cooler with 10000 bp resolution.
Preparing to convolve 1566 tiles:
creating a Pool of 4 workers to tackle 1566 tiles
/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/dotfinder.py:379: RuntimeWarning: invalid value encountered in true_divide
Ek_raw = np.multiply(E_raw, np.divide(KO, KE))
/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/dotfinder.py:379: RuntimeWarning: invalid value encountered in true_divide
Ek_raw = np.multiply(E_raw, np.divide(KO, KE))
/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/dotfinder.py:379: RuntimeWarning: invalid value encountered in true_divide
Ek_raw = np.multiply(E_raw, np.divide(KO, KE))
Traceback (most recent call last):
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/bin/cooltools", line 11, in <module>
load_entry_point('cooltools==0.1.0', 'console_scripts', 'cooltools')()
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/cli/call_dots.py", line 537, in call_dots
nproc, verbose)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/cli/call_dots.py", line 138, in scoring_step
for chunk in chunks:
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/multiprocess/pool.py", line 320, in <genexpr>
return (item for chunk in result for item in chunk)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/multiprocess/pool.py", line 735, in next
raise value
multiprocess.pool.MaybeEncodingError: Error sending result:
And this is followed by a long list of what looks like dataframes with information about, perhaps, chunks? The full output is attached. call_dots.o4801378.txt
Similar errors have been reported when the objects passed in mutiprocessing are too big, so that's why I tried reducing chunksize and max separation, but no luck. Can try to make them even smaller, but presumably you tried it with similar values and it worked, so maybe this is not the issue anyway?
COuld you try:
#!/bin/sh
# Grid Engine options (lines prefixed with #$)
#$ -cwd
#$ -N call_dots
#$ -l h_rt=12:00:00
#$ -l h_vmem=64G
#$ -pe sharedmem 8
#$ -j yes
#$ -V
export MKL_NUM_THREADS=1
cooltools call-dots -n 8 \
-s ../output/dots/$(basename $1)\_10kb_scores.hdf5 \
-o ../output/dots/$(basename $1)\_10kb_dots.bedpe \
-v --fdr 0.1 \
--max-nans-tolerated 7 \
--max-loci-separation 2000000 \
--dots-clustering-radius 21000 \
--tile-size 2000000 \
$1 $1\_expected.tsv
OK, it now worked, thank you! But didn't save a bedpe file, actually, only an hdf5 dump... Maybe when I was struggling with the environment I accidentally reinstalled from the main branch or smth? I'll try to reinstall it retry.
Also, do you really think it requires so much memory? This is 64Gb per core, so 64*8=512Gb...
Ah actually, I checked, and it only used 33Gb total memory.
Weird, I reinstalled it from shrink-donut-dotfinder
, re-ran with exactly the same command, except reduced resources so it starts running faster, and now I get an error:
Tokenization took: 69.72 ms
Type conversion took: 26.42 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 102.02 ms
Type conversion took: 27.02 ms
Parser memory cleanup took: 0.01 ms
Tokenization took: 0.56 ms
Type conversion took: 0.61 ms
Parser memory cleanup took: 0.00 ms
../coolers/WT.10000.cool and ../coolers/WT.10000.cool_expected.tsv passed cross-compatibility checks.
Kernels parameters are set as w,p=5,2 for the cooler with 10000 bp resolution.
preparing big tiles ...
preparing small diagonal tiles ...
scoring and histogramming (DEBUG WAY) ...
Preparing to convolve and build histogramms ...
Creating a Pool of 8 workers
Pool of 8 workers is going to tackle 1327 tiles
Pool of 8 workers is going to tackle 2651 tiles
reduce hchunks and diag_hchunks ...
Traceback (most recent call last):
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/bin/cooltools", line 11, in <module>
sys.exit(cli())
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/cli/call_dots.py", line 1354, in call_dots
nproc, verbose)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/cli/call_dots.py", line 651, in scoring_and_histogramming_step
.format(k,last_la_exp_bin)
AssertionError: There are la_exp.donut.value in (7486.107, inf], please check the histogram
Is it normal the pool of 8 workers is going to tackle some number of tiles twice? And does it have anything to do with the error?
shrink-donut-dotfinder
is indeed going to do convolution twice! once in a wider region max-loci-separation
around the diagonal and one more time in a fixed small region around diagonal,, where your donuts start to shrink - that's the price we're paying to implement it2^(HiCCUPS_W1_MAX_INDX/3)
, so whenever something gets into the bin [2^(HiCCUPS_W1_MAX_INDX/3), +inf)
the program stops and yells at you ...I see, thanks!
Yeah, it ran now with 8*8 Gb (when it gave that error).
OK, maybe it's just because I have fewer contacts than in their data? Anyway, is there some quick fix for this I could implement?
so @Phlya , temporary solution , let's make this HiCCUPS_W1_MAX_INDX
a command line parameter for now - so that you can adjust it as needed
as a justification, let me share some slides with you: https://docs.google.com/presentation/d/1-hwAS4G5G4LbmmelonE4uubmfd6-V1XzpDI3z72RdZA/edit?usp=sharing
there slides 24-30 are all about the improvements that shrinking_donuts
bring - look at the short range "dots/loops" in the histograms: 24-25 slides
then 26-27 - about how shrinking donuts actually look like
Does it automatically exclude the first two diagonals, btw? If not, this could cause very high read counts, I guess...
call_dots
script auto-ignores only the main diagonal (true-HiCCUPS style)expected
with 2 diagonals "missing" it would ignore 2, Again, all of that stems from the fact that I was trying to make call_dots
, as close to HiCCUPS as possible, so I was just literally working on one example, i.e. their GM
data, using their binned contacts (dump .hic
into .cool
) and trheir balancing even.
I was meant to start comparing entire pipelines, but never got to that point - i.e. let's take their raw fastqs map them with distiller, bin the data, balance (here important question is cis vs global balancing, they do cis-only, cause they can't do global ...), call dots and compare again - this stuff is still pending
ok, let me do the fix, so you could run the thing at least
OK, very interesting presentation, thank you! (A LOT of reads you have there, but 51% trans??? I had to ask, sorry)
Yeah, my expected has two diagonals nan'd, so it won't be that - saw you had that problem in the presentation, so checked already.
Thank you, let me know when/how it works!
Ok, i pushed this tiny thing : https://github.com/dekkerlab/cooltools/commit/29c70f250844ddf6071020a410f4221a18fd564f
it should help you if you set this parameter to, say 43 ...
there are more gotchas from HiCCUPS in the shrinking_donuts
branch - all of them stem from trying to get closer to HiCCUPS - that implies including their "bug/weirdness":
like this: https://github.com/dekkerlab/cooltools/blob/29c70f250844ddf6071020a410f4221a18fd564f/cooltools/cli/call_dots.py#L1286
and this: https://github.com/dekkerlab/cooltools/blob/29c70f250844ddf6071020a410f4221a18fd564f/cooltools/cli/call_dots.py#L1295
it's the same operation - creating this log-bins for locally adjusted expected (i.e., your donuts, shmonuts etc - slide 22 from that keynote) and I do it twice, once with the factor 2^(1/3)
and the other time with the factor 2^0.33
- that's what happens in HiCCUPS, - they do it in 2 places differently! (I think i'll report it as an issue to them) - it's not a big deal, but it causes >1% discrepancy in the final dot-calls ...
so, just beware shirnking_donuts
branch is me, trying to get closer to HiCCUPS, top reporduce it - not neccessarily improve upon it - all of that will be rewritten in a sane manner ... at some point
oh 51% trans - that's for H1-ESC cells - I don't know much biology, but that's just how heatmaps looks like stem cells, you know ... they give fuzzier heatmaps, don't really know what happens with the chromosomes
Thank you! I started it with 43, let's see.
Yeah, I realize you are just trying to reproduce their algorithm as closely as possible for now!
Yeah, must be very different from mouse ES cells then, there it's more like 10-15% trans...
I feel really stupid here, but I get this:
Traceback (most recent call last):
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/bin/cooltools", line 11, in <module>
sys.exit(cli())
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
TypeError: call_dots() got an unexpected keyword argument 'hist_max_bins_expected'
I added cooltools call-dots -h
before the command, and the new argument is there, so I did install the new version...
my bad ... wait
No worries, thank you! Just need to find the value that works now, 43 still failed...
Set that argument from above to 50, and got a new problem:
... A few more clustering infos above...
Clustering is completed:
21 clusters detected
1.62+/-0.84 mean size
Traceback (most recent call last):
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/bin/cooltools", line 11, in <module>
sys.exit(cli())
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/cli/call_dots.py", line 1511, in call_dots
dots_clustering_radius, verbose)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/cli/call_dots.py", line 839, in clustering_step_local
verbose=verbose)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/dotfinder.py", line 120, in clust_2D_pixels
brc.fit(pixels)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/sklearn/cluster/birch.py", line 449, in fit
return self._fit(X)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/sklearn/cluster/birch.py", line 452, in _fit
X = check_array(X, accept_sparse='csr', copy=self.copy)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/sklearn/utils/validation.py", line 462, in check_array
context))
ValueError: Found array with 0 sample(s) (shape=(0, 2)) while a minimum of 1 is required.
any chance you have some weird (almost empty, no Hi-C signal) chromosomes in the dataset ? ones that wwouldn't have any "dots" ?
I had it in human data with chrY
chromosome (chrM
as well) even in male cells ...
In case this is the reason, the way to exclude weirdos would be to remove them from the expected
- as described above: https://github.com/mirnylab/cooltools/issues/39#issuecomment-419562784
this paragraph:
chrM and sometimes chrY screw up dot calling - remove em from expected so that dot-caller won't use them
Awesome, it's working!
For future reference, here is the command I used to remove chrY and chrM:
grep -v chrM coolers/WT.10000.cool_expected.tsv | grep -v chrY > coolers/WT.10000.cool_expected_nochrYM.tsv
Why do I get this error with some coolers?..
Ready to concatenate chunks if filtered pixels ...
Traceback (most recent call last):
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/bin/cooltools", line 11, in <module>
sys.exit(cli())
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/cli/call_dots.py", line 1468, in call_dots
nproc, verbose)
File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/cli/call_dots.py", line 775, in scoring_and_extraction_step
assert ~significant_pixels.duplicated().any()
AssertionError
Another side effect of rapid feature-driven development ...
Because, we are convolving the heatmap with the kernels (donuts,lowleft, etc) twice - one time for the whole region of max-loci-separation
, and the other time - narrow band around the diagonal for shrinking donuts - then we're merging results together
I tried to make it so, that there were no overlap between the two - there must be only ONE entry for each pixel in the final dataframe of locally-adjusted expected-s, i.e. loc.-adj.expected calculated using shrunken kernels must overwrite ones, calculated using full-sized ones.
The key-word here is "i tried" ...
And thing is, I had this issue as well, and though that it got fixed, but apparently not so much ...
I left these assertions
throughout the code, in order to catch any issues ...
What I can do, for now, is to introduce a flag --production
, that would ignore these assertions. Because that overlap is not that big of a deal actually - it's more of sign to me that there are impperfections throughout the code ... @Phlya what do you think - you want to go ahead with a little bug inside, for now, or wait for a proper fix ?
Well, sounds like the "bug" actually doesn't affect any of the results, right? I'd be fine with a quick fix then. Except I don't quite understand why this happens with some files and not others?
Minor issue, in non-final files there is a weight1
column between the coordinates of two ends, it complicates plotting based on just the first 6 columns... Which is easiest to implement for general bedpe files with columns beyond the minimal 6.
it would be great to have an option to pass an adjustment factor to thresholding_step() that allows for modifying the threshold required for enrichment_factors: e.g. enrichment_factor_1 = (enrichment_factor_1 -1 ) * adjustment_factor + 1
it would also be great to report (cluster_start1,cluster_stop1) (cluster_start2,cluster_stop2) in postprocessed calls for each cluster (i.e. bounding rectangle)
Is there a way to adjust the resolution/bin size? I got this working but would like to try a smaller resolution, as I am using Micro-C and I see that call-dots missed lots of loops at smaller resolutions.
Hey @xinyangbing - you are absolutely right in a sense that the default call_dots
CLI interface does not expose a lot of paramteres (yet) - currently its parameters are an exact copy of those from HiCCUPS that were applied to very deep GM12... dataset in Rao et al 2014 - which are extremely strict and optimized for 5/10kb ...
At the moment - the best way to adjust any internal parameter (and maybe to skip some filtering steps) - is to use a jupyter-notebook https://cooltools.readthedocs.io/en/latest/notebooks/08_dot-calling-internals.html https://github.com/mirnylab/cooltools/blob/master/docs/notebooks/08_dot-calling-internals.ipynb
as far as higher resolutions go - the most important parameters to adjust would be w
and p
-> the half-width of the "donut"(convolution kernel as we call it here) and half-width of the internal part of the "donut" (target, dot-footprint, etc) - https://cooltools.readthedocs.io/en/latest/notebooks/08_dot-calling-internals.html#Kernels-used-for-calculating-local-enrichment-of-a-pixel
Here are the sizes that they use in Rao et al 2014 (HICCUPS):
25kb: w=3, p=1
---> donut-width: (2*w+1)*binsize
~170kb; target-width: (2*p+1)*binsize
~70kb
10kb: w=5, p=2
---> donut-width: (2*w+1)*binsize
~110kb; target-width: (2*p+1)*binsize
~50kb
5kb: w=7, p=4
---> donut-width: (2*w+1)*binsize
~75kb; target-width: (2*p+1)*binsize
~45kb
so what would be reasonable w
and p
for 2kb ? we don't have a definitive answer, but perhaps something like: w
~13+/-2 and p
~9-10 - maybe a good start:
(2w+1)binsize
~54kb and (2p+1)binsize
~42kb
please let us know if you are willing to give the step-by-step notebook a try - we can provide further assistance with that
This issue is going to briefly touch upon the state of the dot-calling in
cooltools
.Intro
We were trying to re-implement HiCCUPS dot-calling algorithm under the
cooltools
umbrella for some time now. It is still under active development and right now code is scattered across forks and branches.master
The initial progress that we made with dot-calling, by implementing convolution based calculation of locally adjusted expected (
donut
,lowleft
,vertical
,horizontal
) is reflected in this repository in themaster
branch. The post processing steps in thismaster
are closest to the so-called BH-FDR version of dot-calling in the original HiCCUPS paper (Rao etal 2014) - in a sense that we do not do the lambda-chunking to perform multiple hypothesis testing. Moreover this implementation simply ends with the dump of the pre-calculated adjusted expected for different kernels:donut
,lowleft
,vertical
,horizontal
, and reports that post-processing in a bad shape. Thus this isn't very usable for now, not for the final dot-calling at least.new-dotfinder
Lambda-chunking procedure is implemented in the
dekkerlab
fork of thecooltools
branchnew-dotfinder
, which ispip
-installable:The second command would allow to modify the source code, whereas the 1st one would simply install it. One would want to modify the source code if the enrichment threshold modification is needed, for instance - as those are not implemented as CLI options just yet. A typical run would be:
Which would produce list of dots that passed multiple hypothesis testing (the lambda-chunking step itself) but haven't been post-processed, i.e. clustered or filtered by enrichment. The post processed list of dots would show up in the same folder as
signif_dots
but with the prefixfinal_
- we'll fix this ugliness later on of course.call_dots
CLI determines resolution and pick correct kernels parametersw
,p
accordingly (all of the defaults kept same as HiCCUPS). This dot-caller albeit very close to HiCCUPS implementation, deviates from it in some regards, some of the most importnat aspects:lowleft
)Birch
, HiCCUPS implements special greedy algorithm for that - results are very close anyways.NaNs
after balancing): HiCCUPS disregard pixels that are within~5
pixels away from bad-rows/cols, instead we simply check number ofNaNs
in a kernel footprint--max-nans-tolerated
- given the resolution/size of the kernels one can realize which pixels would be discarded.... I might add more details here later on, and edit/elaborate more on this here ...
shrink-donut-dotfinder
This branch elaborates further on top of the
new-dotfinder
by dealing with the disrepancy#1
- dynamic kernels resizing. As it follows from the name here we implemented only the near-diagonal kernels shrinakge - arguably the most important aspect of the dynamic kernels, which was preventing us from calling dots really close to the diagonal and driving the deviation betweencooltools
dot-calling and HiCCUPS. There are no additional parameters that user needs to control in this case, everything is done the same way as in HiCCUPS - and this is data-independent kernel shrinkage, as opposed to the enlarging kernels based on the value of thelowleft
kernels for each pixel tested.This branch eliminates another difference between HiCCUPS and
cooltools
which is related to the way lambda-chunking is implemented and is too technical to describe here. To give some numbers, for Rao et al 2014 GM... primary dataset we are getting ~7700 dots vs ~8050 by HiCCUPS, where overlap is ~7600.