open2c / cooltools

The tools for your .cool's
MIT License
140 stars 51 forks source link

numerical instability: csv/text intermediates, cooler re-balancing #35

Closed sergpolly closed 3 years ago

sergpolly commented 6 years ago

Not a major issue, but something to be aware of and also an argument for including expected into cooler: Running cooltools compute_expected several times on the same input - or to be more precise re-balancing and re-computing expected leads to some minor discrepancy in the output of the expected: screenshot from 2018-09-06 12-20-30

Problem is that downstream analyses - call_dots for instance, is rather sensitive to that sort of change and thus might lead to ~1% discrepancy of the number of the called dots.

So,

  1. consider binary storage options for expected
  2. consider balancing deep data with more stringent precision requirement
Phlya commented 6 years ago

Since it's mentioned here, just want to remind that it would be great to store expected directly in the cooler, then there is no need to store separate files for the same data (and removes the potential for mis-associating files). And I think then there should be a method to directly get expected for a region as a matrix. (Same goes for other things, like eigenvectors, potentially.)

sergpolly commented 6 years ago

@nvictus you've mentioned that before, that for deeply sequenced data we'd need more stringent balancing convergence policy, right (tolerance, rather)? Or it should be even made "dynamic" - i.e. based on how deep the data is ...

Could you comment on it here (we discussed it in hangouts, but I cannot find it by scrolling the chat) - what are rules of thumb - approximately - should we go tol~1e-10 instead of 1e-5 for deeper data? or is it like tol=1000/depth ?

Do you think that the numerical instability that I see in the example above is most likely due to a forced re-balancing with tol=1e-5 ? I realized myself that text/csv is kind of a separate issue (also important), and would not explain the discrepancy in diff ...

golobor commented 6 years ago

@Phlya , (1) storing expected in cooler files is definitely on the short-ish term todo list; i created a separate ticket. (2) expected for a region of a matrix is also an urgent todo item, as soon as we agree on this.

golobor commented 6 years ago

@sergpolly , i converted your question on tolerance into a github issue: https://github.com/mirnylab/cooler/issues/131

sergpolly commented 6 years ago

I did some more testing, and can elaborate on the symptoms more: Balance a cooler several times, like so:

cooler balance -p 8 --ignore-diags 1 --name W1 --force  COOLER

changing the name - W1, W2, W3 ... At this point, the balancing weights are very close - within numerical precision ~ 1e-15: np.isclose(W1,W2,atol=1e-15,rtol=0,equal_nan=True).all() would yield True for some weight-vectors combinations and False for others; while np.isclose(w1,w2,atol=1e-14,rtol=0,equal_nan=True).all() would be giving True for all combinations. Sounds negligible and at the limit of machine epsilon for float64. However tiny that seems, if we try to compute expected using those different weight-s:

cooltools compute_expected -p 8 --drop-diags 1 --weight-name W1 COOLER > X1

We would get different results for the expecteds X1, X2, X3 ..., where the difference would sometimes reach ~1e-13 - didn't check max here. Again not a lot, but something that gets reflected in the output text file - see very first picture.

So, these are the objective symptoms, as far as cooler, expected are concerned. TODO: double check how such a tiny discrepancy leads to ~1% discrepancy in the dot calling ...

sergpolly commented 6 years ago

to be even more precise, here are snakefiles:

nums = [1,2,3,4]
wildcard_constraints:
    index="\d+"

rule all:
    input:
        "debug2/diff.stats"

rule compute_expected:
    input:
        "your_input.cool"
    output:
        "debug2/x{index}.expected"
    threads: 8
    shell:
        "cooltools compute_expected -p {threads} --weight-name w1 --drop-diags 1 {input} > {output}"

rule get_expected_diff:
    input:
        expand("debug2/x{index}.expected",index=nums)
    output:
        "debug2/diff.stats"
    shell:
        "for xi in {input}; do"
        "    for xj in {input}; do"
        "        echo \"$xi $xj\" >> {output};"
        "        diff $xi $xj | wc -l  >> {output};"
        "    done;"
        "done;"

this one runs compute expected a bunch of times without re-balancing the cooler and using single weights-column - w1 in this case. In this case diff.stats are all zeroes - i.e. expected is exactly the same (as printed into a text file).

nums=[1,2,3,4]
wildcard_constraints:
    index="\d+"

rule all:
    input:
        "debug/diff.stats"

# make sure you run this rule one at a time, as HDF5 does not permit several simultanious
# balancing executions ...
rule balance:
    input:
        ancient("your_input.cool")
    output:
        touch("debug/w{index}.touch")
    threads: 18
    shell:
        "cooler balance -p {threads} --ignore-diags 1 --force --name w{wildcards.index} {input}"

rule compute_expected:
    input:
        cool = ancient("your_input.cool"),
        just_link = "debug/w{index}.touch"
    output:
        "debug/x{index}.expected"
    threads: 8
    shell:
        "cooltools compute_expected -p {threads} --weight-name w{wildcards.index} --drop-diags 1 {input.cool} > {output}"

rule get_expected_diff:
    input:
        expand("debug/x{index}.expected",index=nums)
    output:
        "debug/diff.stats"
    shell:
        "for xi in {input}; do"
        "    for xj in {input}; do"
        "        echo \"$xi $xj\" >> {output};"
        "        diff $xi $xj | wc -l  >> {output};"
        "    done; "
        "done;"

this one does both the re-balancing and re-computing expected a bunch of times - and in this case diff.stats isn't trivial, i.e. some discrepancies (sort of like in the very first comment), which we know are numerically almost negligible, but apparently affect dot-calling (to be tested more rigorously)

TODO:

PS: all of that seems useless for public, but +/- ~200 (out of ~20000) called dots seems relatively important to me - and also might set a reasonable reproducibility limits for HiCCUPS reimplementation and help explain some discrepancies I've seen so far - also educational value

golobor commented 3 years ago

@sergpolly - what's your take on this - is it still an issue?

sergpolly commented 3 years ago

I guess that was an argument in favor of storing expected as binary