open2c / cooltools

The tools for your .cool's
MIT License
139 stars 51 forks source link

dot calling state of the art #39

Closed sergpolly closed 3 years ago

sergpolly commented 6 years ago

This issue is going to briefly touch upon the state of the dot-calling in cooltools.

Intro

We were trying to re-implement HiCCUPS dot-calling algorithm under the cooltools umbrella for some time now. It is still under active development and right now code is scattered across forks and branches.

master

The initial progress that we made with dot-calling, by implementing convolution based calculation of locally adjusted expected (donut, lowleft, vertical, horizontal) is reflected in this repository in the master branch. The post processing steps in this master are closest to the so-called BH-FDR version of dot-calling in the original HiCCUPS paper (Rao etal 2014) - in a sense that we do not do the lambda-chunking to perform multiple hypothesis testing. Moreover this implementation simply ends with the dump of the pre-calculated adjusted expected for different kernels: donut, lowleft, vertical, horizontal, and reports that post-processing in a bad shape. Thus this isn't very usable for now, not for the final dot-calling at least.

new-dotfinder

Lambda-chunking procedure is implemented in the dekkerlab fork of the cooltools branch new-dotfinder, which is pip-installable:

pip install git+https://github.com/dekkerlab/cooltools@new-dotfinder
pip install -e git+https://github.com/dekkerlab/cooltools@new-dotfinder#egg=cooltools

The second command would allow to modify the source code, whereas the 1st one would simply install it. One would want to modify the source code if the enrichment threshold modification is needed, for instance - as those are not implemented as CLI options just yet. A typical run would be:

cooltools call_dots -n {cores} 
    -o {signif_dots} -v --fdr 0.1 
    --max-nans-tolerated 7 
    --max-loci-separation 20000000 
    --dots-clustering-radius 21000 
    --tile-size 10000000 
    {input_cooler} {input_expected}

Which would produce list of dots that passed multiple hypothesis testing (the lambda-chunking step itself) but haven't been post-processed, i.e. clustered or filtered by enrichment. The post processed list of dots would show up in the same folder as signif_dots but with the prefix final_ - we'll fix this ugliness later on of course. call_dots CLI determines resolution and pick correct kernels parameters w, p accordingly (all of the defaults kept same as HiCCUPS). This dot-caller albeit very close to HiCCUPS implementation, deviates from it in some regards, some of the most importnat aspects:

  1. fixed kernels size for every pixel (in HiCCUPS "donuts" are shrinked near the diagonal and enlarged as needed based on the value of lowleft)
  2. clustering is slightly different - we use off the shelve Birch, HiCCUPS implements special greedy algorithm for that - results are very close anyways.
  3. really minor detail in the way we treat pixels near the bad-rows/columns (ones filled with NaNs after balancing): HiCCUPS disregard pixels that are within ~5 pixels away from bad-rows/cols, instead we simply check number of NaNs in a kernel footprint --max-nans-tolerated - given the resolution/size of the kernels one can realize which pixels would be discarded.

... I might add more details here later on, and edit/elaborate more on this here ...

shrink-donut-dotfinder

This branch elaborates further on top of the new-dotfinder by dealing with the disrepancy #1 - dynamic kernels resizing. As it follows from the name here we implemented only the near-diagonal kernels shrinakge - arguably the most important aspect of the dynamic kernels, which was preventing us from calling dots really close to the diagonal and driving the deviation between cooltools dot-calling and HiCCUPS. There are no additional parameters that user needs to control in this case, everything is done the same way as in HiCCUPS - and this is data-independent kernel shrinkage, as opposed to the enlarging kernels based on the value of the lowleft kernels for each pixel tested.

This branch eliminates another difference between HiCCUPS and cooltools which is related to the way lambda-chunking is implemented and is too technical to describe here. To give some numbers, for Rao et al 2014 GM... primary dataset we are getting ~7700 dots vs ~8050 by HiCCUPS, where overlap is ~7600.

sergpolly commented 6 years ago

An entire Snakefile that I'd use to call dots at 5 and 10kb and combine for a bunch of samples/coolers. Broken down and commented:

you'd typically need ~10kb and 5kb coolers to perform GM-level dot-calling also, describe your samples here - i.e. ./path(1,2)/your-sample(1,2).cool must exist

resolutions = [5000,10000]
names = ["your-sample1","your-sample2"]
paths = ["path1", "path2"]

the snake-thing about wildcards ...

wildcard_constraints:
    res="\d+"

get the combined bedpe lists with the dots

rule all:
    input:
        expand("{path}/combineddots/final_cloops_{hic_name}.combined.bedpe",zip,path=paths,hic_name=names)

force rebalance - in case you'd want cis only, modify tolerance, diags, whatever ...

rule balance:
    input:
        "{path}/{hic_name}.{res}.cool"
    output:
        touch("{path}/touchtmp/{hic_name}.{res}.touch")
    threads: 9
    shell:
        "cooler balance -p {threads} --ignore-diags 1 --force {input}"

expected ...

rule compute_expected:
    input:
        "{path}/{hic_name}.{res}.cool",
        "{path}/touchtmp/{hic_name}.{res}.touch"
    output:
        "{path}/expected/{hic_name}.{res}.cool.cis.expected.ALL"
    threads: 9
    shell:
        "cooltools compute_expected -p {threads} --drop-diags 1 {input[0]} > {output}"

chrM and sometimes chrY screw up dot calling - remove em from expected so that dot-caller won't use them

rule clean_expected:
    input:
        "{path}/expected/{hic_name}.{res}.cool.cis.expected.ALL"
    output:
        "{path}/expected/{hic_name}.{res}.cool.cis.expected"
    shell:
        "grep -v -e \"^chrM\" {input} | grep -v -e \"^chrY\" > {output}"

dot-calling for each sample/resolution ...

rule call_dots:
    input:
        cooler = "{path}/{hic_name}.{res}.cool",
        expected = "{path}/expected/{hic_name}.{res}.cool.cis.expected"
    params:
        fdr = 0.1,
        diag_width = 10000000,
        tile_size = 10000000
    output:
        signif_dots="{path}/dots/cloops_{hic_name}.{res}.bedpe",
        filtered_dots="{path}/dots/final_cloops_{hic_name}.{res}.bedpe"
    threads: 17
    shell:
        "cooltools call_dots -n {threads} "
        "    -o {output.signif_dots} -v --fdr {params.fdr} "
        "    --max-nans-tolerated 7 "
        "    --max-loci-separation {params.diag_width} "
        "    --dots-clustering-radius 21000 "
        "    --tile-size {params.tile_size} "
        "    {input.cooler} {input.expected}"

merge dots using little unfinished script that should probably end up in cooltools at some point https://github.com/sergpolly/random_tools

rule merge_dots_res:
    input:
        expand("{{path}}/dots/final_cloops_{{hic_name}}.{res}.bedpe",res=resolutions)
    params:
        radius = 10000
    output:
        "{path}/combineddots/final_cloops_{hic_name}.combined.bedpe"
    shell:
        "python ./random_tools/merge_dot_lists.py "
        "    --radius {params.radius} -v "
        "    --output {output} {input}"

in case anyone needs this - to prepare multires bedpe for higlass ingestion ...

# rule prepare_multires:
#     input:
#         "final_cloops_{hic_name}.combined.bedpe"
#     output:
#         "final_cloops_{hic_name}.combined.bedpe.multires"
#     params:
#         genome = "hg19"
#     shell:
#         "clodius aggregate bedpe "
#         "       -a {params.genome} --has-header "
#         "       --chr1-col 1 --from1-col 2 --to1-col 3 "
#         "       --chr2-col 4 --from2-col 5 --to2-col 6 "
#         "       --output-file {output} {input} "
Phlya commented 6 years ago

FYI the actual command that works for me is cooltools call-dots, i.e. not underscore, but dash... Not sure why this happens. Seems like with the new installation all underscores were converted to dashes, like in compute-expected etc as well.

Feature request: specify which chromosome(s) or regions to use, it would be convenient to quickly test on just a small subset.

Also, -s, --output-scores is a required argument, but not used in the examples here (maybe it just doesn't need to be required and it's a bug?).

Phlya commented 6 years ago

Also, would be curious to see rough timings for different resolutions/datasets/numbers of cores, and how much memory is required.

Phlya commented 6 years ago

And I can't get it to work... Here is the error:

I call it like this:

#!/bin/sh
# Grid Engine options (lines prefixed with #$)
#$ -cwd
#$ -N call_dots
#$ -l h_rt=12:00:00
#$ -l h_vmem=32G
#$ -pe sharedmem 4
#$ -j yes
#$ -V

cooltools call-dots -n 4 \
    -s ../output/dots/$(basename $1)\_10kb_scores.hdf5 \
    -o ../output/dots/$(basename $1)\_10kb_dots.bedpe \
    -v --fdr 0.1 \
    --max-nans-tolerated 7 \
    --max-loci-separation 10000000 \
    --dots-clustering-radius 21000 \
    --tile-size 5000000 \
    $1 $1\_expected.tsv

Output:

Tokenization took: 128.85 ms
Type conversion took: 48.86 ms
Parser memory cleanup took: 0.01 ms
Tokenization took: 105.19 ms
Type conversion took: 33.80 ms
Parser memory cleanup took: 0.01 ms
Tokenization took: 0.72 ms
Type conversion took: 0.78 ms
Parser memory cleanup took: 0.00 ms
../coolers/2i.10000.cool and ../coolers/2i.10000.cool_expected.tsv passed cross-compatibility checks.
Kernels parameters are set as w,p=5,2 for the cooler with 10000 bp resolution.
Preparing to convolve 1566 tiles:
creating a Pool of 4 workers to tackle 1566 tiles
/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/dotfinder.py:379: RuntimeWarning: invalid value encountered in true_divide
  Ek_raw = np.multiply(E_raw, np.divide(KO, KE))
/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/dotfinder.py:379: RuntimeWarning: invalid value encountered in true_divide
  Ek_raw = np.multiply(E_raw, np.divide(KO, KE))
/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/dotfinder.py:379: RuntimeWarning: invalid value encountered in true_divide
  Ek_raw = np.multiply(E_raw, np.divide(KO, KE))
Traceback (most recent call last):
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/bin/cooltools", line 11, in <module>
    load_entry_point('cooltools==0.1.0', 'console_scripts', 'cooltools')()
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/cli/call_dots.py", line 537, in call_dots
    nproc, verbose)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/cli/call_dots.py", line 138, in scoring_step
    for chunk in chunks:
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/multiprocess/pool.py", line 320, in <genexpr>
    return (item for chunk in result for item in chunk)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/multiprocess/pool.py", line 735, in next
    raise value
multiprocess.pool.MaybeEncodingError: Error sending result:

And this is followed by a long list of what looks like dataframes with information about, perhaps, chunks? The full output is attached. call_dots.o4801378.txt

Similar errors have been reported when the objects passed in mutiprocessing are too big, so that's why I tried reducing chunksize and max separation, but no luck. Can try to make them even smaller, but presumably you tried it with similar values and it worked, so maybe this is not the issue anyway?

sergpolly commented 6 years ago

COuld you try:

#!/bin/sh
# Grid Engine options (lines prefixed with #$)
#$ -cwd
#$ -N call_dots
#$ -l h_rt=12:00:00
#$ -l h_vmem=64G
#$ -pe sharedmem 8
#$ -j yes
#$ -V

export MKL_NUM_THREADS=1

cooltools call-dots -n 8 \
    -s ../output/dots/$(basename $1)\_10kb_scores.hdf5 \
    -o ../output/dots/$(basename $1)\_10kb_dots.bedpe \
    -v --fdr 0.1 \
    --max-nans-tolerated 7 \
    --max-loci-separation 2000000 \
    --dots-clustering-radius 21000 \
    --tile-size 2000000 \
    $1 $1\_expected.tsv
Phlya commented 6 years ago

OK, it now worked, thank you! But didn't save a bedpe file, actually, only an hdf5 dump... Maybe when I was struggling with the environment I accidentally reinstalled from the main branch or smth? I'll try to reinstall it retry.

Also, do you really think it requires so much memory? This is 64Gb per core, so 64*8=512Gb...

Phlya commented 6 years ago

Ah actually, I checked, and it only used 33Gb total memory.

Phlya commented 6 years ago

Weird, I reinstalled it from shrink-donut-dotfinder, re-ran with exactly the same command, except reduced resources so it starts running faster, and now I get an error:

Tokenization took: 69.72 ms
Type conversion took: 26.42 ms
Parser memory cleanup took: 0.00 ms
Tokenization took: 102.02 ms
Type conversion took: 27.02 ms
Parser memory cleanup took: 0.01 ms
Tokenization took: 0.56 ms
Type conversion took: 0.61 ms
Parser memory cleanup took: 0.00 ms
../coolers/WT.10000.cool and ../coolers/WT.10000.cool_expected.tsv passed cross-compatibility checks.
Kernels parameters are set as w,p=5,2 for the cooler with 10000 bp resolution.
preparing big tiles ...
preparing small diagonal tiles ...
scoring and histogramming (DEBUG WAY) ...
Preparing to convolve and build histogramms ...
Creating a Pool of 8 workers
Pool of 8 workers is going to tackle 1327 tiles
Pool of 8 workers is going to tackle 2651 tiles
reduce hchunks and diag_hchunks ...
Traceback (most recent call last):
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/bin/cooltools", line 11, in <module>
    sys.exit(cli())
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/cli/call_dots.py", line 1354, in call_dots
    nproc, verbose)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/cli/call_dots.py", line 651, in scoring_and_histogramming_step
    .format(k,last_la_exp_bin)
AssertionError: There are la_exp.donut.value in (7486.107, inf], please check the histogram

Is it normal the pool of 8 workers is going to tackle some number of tiles twice? And does it have anything to do with the error?

sergpolly commented 6 years ago
Phlya commented 6 years ago

I see, thanks!

Yeah, it ran now with 8*8 Gb (when it gave that error).

OK, maybe it's just because I have fewer contacts than in their data? Anyway, is there some quick fix for this I could implement?

sergpolly commented 6 years ago

so @Phlya , temporary solution , let's make this HiCCUPS_W1_MAX_INDX a command line parameter for now - so that you can adjust it as needed

as a justification, let me share some slides with you: https://docs.google.com/presentation/d/1-hwAS4G5G4LbmmelonE4uubmfd6-V1XzpDI3z72RdZA/edit?usp=sharing

there slides 24-30 are all about the improvements that shrinking_donuts bring - look at the short range "dots/loops" in the histograms: 24-25 slides then 26-27 - about how shrinking donuts actually look like

Phlya commented 6 years ago

Does it automatically exclude the first two diagonals, btw? If not, this could cause very high read counts, I guess...

sergpolly commented 6 years ago

Again, all of that stems from the fact that I was trying to make call_dots , as close to HiCCUPS as possible, so I was just literally working on one example, i.e. their GM data, using their binned contacts (dump .hic into .cool) and trheir balancing even. I was meant to start comparing entire pipelines, but never got to that point - i.e. let's take their raw fastqs map them with distiller, bin the data, balance (here important question is cis vs global balancing, they do cis-only, cause they can't do global ...), call dots and compare again - this stuff is still pending

sergpolly commented 6 years ago

ok, let me do the fix, so you could run the thing at least

Phlya commented 6 years ago

OK, very interesting presentation, thank you! (A LOT of reads you have there, but 51% trans??? I had to ask, sorry)

Yeah, my expected has two diagonals nan'd, so it won't be that - saw you had that problem in the presentation, so checked already.

Thank you, let me know when/how it works!

sergpolly commented 6 years ago

Ok, i pushed this tiny thing : https://github.com/dekkerlab/cooltools/commit/29c70f250844ddf6071020a410f4221a18fd564f

it should help you if you set this parameter to, say 43 ...

there are more gotchas from HiCCUPS in the shrinking_donuts branch - all of them stem from trying to get closer to HiCCUPS - that implies including their "bug/weirdness": like this: https://github.com/dekkerlab/cooltools/blob/29c70f250844ddf6071020a410f4221a18fd564f/cooltools/cli/call_dots.py#L1286 and this: https://github.com/dekkerlab/cooltools/blob/29c70f250844ddf6071020a410f4221a18fd564f/cooltools/cli/call_dots.py#L1295

it's the same operation - creating this log-bins for locally adjusted expected (i.e., your donuts, shmonuts etc - slide 22 from that keynote) and I do it twice, once with the factor 2^(1/3) and the other time with the factor 2^0.33 - that's what happens in HiCCUPS, - they do it in 2 places differently! (I think i'll report it as an issue to them) - it's not a big deal, but it causes >1% discrepancy in the final dot-calls ...

so, just beware shirnking_donuts branch is me, trying to get closer to HiCCUPS, top reporduce it - not neccessarily improve upon it - all of that will be rewritten in a sane manner ... at some point

sergpolly commented 6 years ago

oh 51% trans - that's for H1-ESC cells - I don't know much biology, but that's just how heatmaps looks like stem cells, you know ... they give fuzzier heatmaps, don't really know what happens with the chromosomes

Phlya commented 6 years ago

Thank you! I started it with 43, let's see.

Yeah, I realize you are just trying to reproduce their algorithm as closely as possible for now!

Yeah, must be very different from mouse ES cells then, there it's more like 10-15% trans...

Phlya commented 6 years ago

I feel really stupid here, but I get this:

Traceback (most recent call last):
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/bin/cooltools", line 11, in <module>
    sys.exit(cli())
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
TypeError: call_dots() got an unexpected keyword argument 'hist_max_bins_expected'

I added cooltools call-dots -h before the command, and the new argument is there, so I did install the new version...

sergpolly commented 6 years ago

my bad ... wait

sergpolly commented 6 years ago

https://github.com/dekkerlab/cooltools/commit/377106efadacdfde4c208829634f34329dec8769 i'm stupid - not you

Phlya commented 6 years ago

No worries, thank you! Just need to find the value that works now, 43 still failed...

Phlya commented 6 years ago

Set that argument from above to 50, and got a new problem:

... A few more clustering infos above...

Clustering is completed:
21 clusters detected
1.62+/-0.84 mean size

Traceback (most recent call last):
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/bin/cooltools", line 11, in <module>
    sys.exit(cli())
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/cli/call_dots.py", line 1511, in call_dots
    dots_clustering_radius, verbose)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/cli/call_dots.py", line 839, in clustering_step_local
    verbose=verbose)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/dotfinder.py", line 120, in clust_2D_pixels
    brc.fit(pixels)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/sklearn/cluster/birch.py", line 449, in fit
    return self._fit(X)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/sklearn/cluster/birch.py", line 452, in _fit
    X = check_array(X, accept_sparse='csr', copy=self.copy)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/sklearn/utils/validation.py", line 462, in check_array
    context))
ValueError: Found array with 0 sample(s) (shape=(0, 2)) while a minimum of 1 is required.
sergpolly commented 6 years ago

any chance you have some weird (almost empty, no Hi-C signal) chromosomes in the dataset ? ones that wwouldn't have any "dots" ? I had it in human data with chrY chromosome (chrM as well) even in male cells ... In case this is the reason, the way to exclude weirdos would be to remove them from the expected - as described above: https://github.com/mirnylab/cooltools/issues/39#issuecomment-419562784 this paragraph:

chrM and sometimes chrY screw up dot calling - remove em from expected so that dot-caller won't use them

Phlya commented 6 years ago

Awesome, it's working!

For future reference, here is the command I used to remove chrY and chrM: grep -v chrM coolers/WT.10000.cool_expected.tsv | grep -v chrY > coolers/WT.10000.cool_expected_nochrYM.tsv

Phlya commented 6 years ago

Why do I get this error with some coolers?..

Ready to concatenate chunks if filtered pixels ...
Traceback (most recent call last):
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/bin/cooltools", line 11, in <module>
    sys.exit(cli())
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/cli/call_dots.py", line 1468, in call_dots
    nproc, verbose)
  File "/exports/igmm/eddie/wendy-lab/ilia/condaenvs/dotfinder/lib/python3.6/site-packages/cooltools/cli/call_dots.py", line 775, in scoring_and_extraction_step
    assert ~significant_pixels.duplicated().any()
AssertionError
sergpolly commented 6 years ago

Another side effect of rapid feature-driven development ... Because, we are convolving the heatmap with the kernels (donuts,lowleft, etc) twice - one time for the whole region of max-loci-separation, and the other time - narrow band around the diagonal for shrinking donuts - then we're merging results together I tried to make it so, that there were no overlap between the two - there must be only ONE entry for each pixel in the final dataframe of locally-adjusted expected-s, i.e. loc.-adj.expected calculated using shrunken kernels must overwrite ones, calculated using full-sized ones.

The key-word here is "i tried" ... And thing is, I had this issue as well, and though that it got fixed, but apparently not so much ... I left these assertions throughout the code, in order to catch any issues ...

What I can do, for now, is to introduce a flag --production , that would ignore these assertions. Because that overlap is not that big of a deal actually - it's more of sign to me that there are impperfections throughout the code ... @Phlya what do you think - you want to go ahead with a little bug inside, for now, or wait for a proper fix ?

Phlya commented 6 years ago

Well, sounds like the "bug" actually doesn't affect any of the results, right? I'd be fine with a quick fix then. Except I don't quite understand why this happens with some files and not others?

Phlya commented 6 years ago

Minor issue, in non-final files there is a weight1 column between the coordinates of two ends, it complicates plotting based on just the first 6 columns... Which is easiest to implement for general bedpe files with columns beyond the minimal 6.

gfudenberg commented 5 years ago

it would be great to have an option to pass an adjustment factor to thresholding_step() that allows for modifying the threshold required for enrichment_factors: e.g. enrichment_factor_1 = (enrichment_factor_1 -1 ) * adjustment_factor + 1

gfudenberg commented 5 years ago

it would also be great to report (cluster_start1,cluster_stop1) (cluster_start2,cluster_stop2) in postprocessed calls for each cluster (i.e. bounding rectangle)

xinyangbing commented 4 years ago

Is there a way to adjust the resolution/bin size? I got this working but would like to try a smaller resolution, as I am using Micro-C and I see that call-dots missed lots of loops at smaller resolutions.

sergpolly commented 4 years ago

Hey @xinyangbing - you are absolutely right in a sense that the default call_dots CLI interface does not expose a lot of paramteres (yet) - currently its parameters are an exact copy of those from HiCCUPS that were applied to very deep GM12... dataset in Rao et al 2014 - which are extremely strict and optimized for 5/10kb ...

At the moment - the best way to adjust any internal parameter (and maybe to skip some filtering steps) - is to use a jupyter-notebook https://cooltools.readthedocs.io/en/latest/notebooks/08_dot-calling-internals.html https://github.com/mirnylab/cooltools/blob/master/docs/notebooks/08_dot-calling-internals.ipynb

as far as higher resolutions go - the most important parameters to adjust would be w and p -> the half-width of the "donut"(convolution kernel as we call it here) and half-width of the internal part of the "donut" (target, dot-footprint, etc) - https://cooltools.readthedocs.io/en/latest/notebooks/08_dot-calling-internals.html#Kernels-used-for-calculating-local-enrichment-of-a-pixel

Here are the sizes that they use in Rao et al 2014 (HICCUPS): 25kb: w=3, p=1 ---> donut-width: (2*w+1)*binsize ~170kb; target-width: (2*p+1)*binsize ~70kb 10kb: w=5, p=2 ---> donut-width: (2*w+1)*binsize ~110kb; target-width: (2*p+1)*binsize ~50kb 5kb: w=7, p=4 ---> donut-width: (2*w+1)*binsize ~75kb; target-width: (2*p+1)*binsize ~45kb

so what would be reasonable w and p for 2kb ? we don't have a definitive answer, but perhaps something like: w ~13+/-2 and p~9-10 - maybe a good start: (2w+1)binsize ~54kb and (2p+1)binsize~42kb

please let us know if you are willing to give the step-by-step notebook a try - we can provide further assistance with that