Closed sergpolly closed 3 years ago
Since it's mentioned here, just want to remind that it would be great to store expected
directly in the cooler, then there is no need to store separate files for the same data (and removes the potential for mis-associating files). And I think then there should be a method to directly get expected
for a region as a matrix.
(Same goes for other things, like eigenvectors, potentially.)
@nvictus you've mentioned that before, that for deeply sequenced data we'd need more stringent balancing convergence policy, right (tolerance, rather)? Or it should be even made "dynamic" - i.e. based on how deep the data is ...
Could you comment on it here (we discussed it in hangouts, but I cannot find it by scrolling the chat) - what are rules of thumb - approximately - should we go tol~1e-10 instead of 1e-5 for deeper data? or is it like tol=1000/depth ?
Do you think that the numerical instability that I see in the example above is most likely due to a forced re-balancing with tol=1e-5 ? I realized myself that text/csv is kind of a separate issue (also important), and would not explain the discrepancy in diff
...
@Phlya , (1) storing expected in cooler files is definitely on the short-ish term todo list; i created a separate ticket. (2) expected for a region of a matrix is also an urgent todo item, as soon as we agree on this.
@sergpolly , i converted your question on tolerance into a github issue: https://github.com/mirnylab/cooler/issues/131
I did some more testing, and can elaborate on the symptoms more: Balance a cooler several times, like so:
cooler balance -p 8 --ignore-diags 1 --name W1 --force COOLER
changing the name - W1
, W2
, W3
...
At this point, the balancing weights are very close - within numerical precision ~ 1e-15:
np.isclose(W1,W2,atol=1e-15,rtol=0,equal_nan=True).all()
would yield True
for some weight-vectors combinations and False
for others; while np.isclose(w1,w2,atol=1e-14,rtol=0,equal_nan=True).all()
would be giving True
for all combinations. Sounds negligible and at the limit of machine epsilon for float64
.
However tiny that seems, if we try to compute expected using those different weight
-s:
cooltools compute_expected -p 8 --drop-diags 1 --weight-name W1 COOLER > X1
We would get different results for the expecteds X1
, X2
, X3
..., where the difference would sometimes reach ~1e-13 - didn't check max here.
Again not a lot, but something that gets reflected in the output text file - see very first picture.
So, these are the objective symptoms, as far as cooler
, expected
are concerned.
TODO: double check how such a tiny discrepancy leads to ~1% discrepancy in the dot calling ...
to be even more precise, here are snakefiles:
nums = [1,2,3,4]
wildcard_constraints:
index="\d+"
rule all:
input:
"debug2/diff.stats"
rule compute_expected:
input:
"your_input.cool"
output:
"debug2/x{index}.expected"
threads: 8
shell:
"cooltools compute_expected -p {threads} --weight-name w1 --drop-diags 1 {input} > {output}"
rule get_expected_diff:
input:
expand("debug2/x{index}.expected",index=nums)
output:
"debug2/diff.stats"
shell:
"for xi in {input}; do"
" for xj in {input}; do"
" echo \"$xi $xj\" >> {output};"
" diff $xi $xj | wc -l >> {output};"
" done;"
"done;"
this one runs compute expected
a bunch of times without re-balancing the cooler and using single weights
-column - w1
in this case. In this case diff.stats
are all zeroes - i.e. expected is exactly the same (as printed into a text file).
nums=[1,2,3,4]
wildcard_constraints:
index="\d+"
rule all:
input:
"debug/diff.stats"
# make sure you run this rule one at a time, as HDF5 does not permit several simultanious
# balancing executions ...
rule balance:
input:
ancient("your_input.cool")
output:
touch("debug/w{index}.touch")
threads: 18
shell:
"cooler balance -p {threads} --ignore-diags 1 --force --name w{wildcards.index} {input}"
rule compute_expected:
input:
cool = ancient("your_input.cool"),
just_link = "debug/w{index}.touch"
output:
"debug/x{index}.expected"
threads: 8
shell:
"cooltools compute_expected -p {threads} --weight-name w{wildcards.index} --drop-diags 1 {input.cool} > {output}"
rule get_expected_diff:
input:
expand("debug/x{index}.expected",index=nums)
output:
"debug/diff.stats"
shell:
"for xi in {input}; do"
" for xj in {input}; do"
" echo \"$xi $xj\" >> {output};"
" diff $xi $xj | wc -l >> {output};"
" done; "
"done;"
this one does both the re-balancing and re-computing expected a bunch of times - and in this case diff.stats
isn't trivial, i.e. some discrepancies (sort of like in the very first comment), which we know are numerically almost negligible, but apparently affect dot-calling (to be tested more rigorously)
TODO:
--tol
affects any of that ?! - maybe quantitatively ?! test it rigorously1e+9 + 1e-9 - 1e+9 != 1e+9 - 1e+9 + 1e-9
PS: all of that seems useless for public, but +/- ~200 (out of ~20000) called dots seems relatively important to me - and also might set a reasonable reproducibility limits for HiCCUPS reimplementation and help explain some discrepancies I've seen so far - also educational value
@sergpolly - what's your take on this - is it still an issue?
I guess that was an argument in favor of storing expected as binary
Not a major issue, but something to be aware of and also an argument for including
expected
intocooler
: Runningcooltools compute_expected
several times on the same input - or to be more precise re-balancing and re-computing expected leads to some minor discrepancy in the output of the expected:Problem is that downstream analyses -
call_dots
for instance, is rather sensitive to that sort of change and thus might lead to ~1% discrepancy of the number of the called dots.So,
expected