open2c / cooltools

The tools for your .cool's
MIT License
137 stars 51 forks source link

divide by 2 issue in cooltools.coverage #255

Closed gfudenberg closed 3 years ago

gfudenberg commented 3 years ago

if pixels are upper-triangular, then dividing by 2 should not be required, i.e. we should remove:

covs = covs / 2

image

sergpolly commented 3 years ago

that bothered me some time ago, but I was convinced this was correct: https://github.com/open2c/cooltools/issues/99#issuecomment-543974052

I'll try to understand it deeper again

sergpolly commented 3 years ago

so yeah , i think divide by 2 is correct way, because thouse bincount-s are indeed double counting every pixel Screenshot from 2021-09-01 19-42-10

e.g. that pixel [2,3] in the figure - it'll "participate" in the summing up counts for row 2 AND row 3 - so you'll count it twice ... So , to keep the sum(coverage) same as the total number of pairs/ligations that went into creating this matrix - every pixel should contribute 0.5*count to every row instead of count

PS sum(coverage) wouldn't match total number of "pairs" because we ignore diagonals, etc

sergpolly commented 3 years ago

that bothered me so much, i made some slides about it - https://github.com/open2c/cooltools/issues/99#issuecomment-569380032

check slides 20, 21, ... for the demonstration of why every pixel contributes 2x to the coverage at the end

agalitsyna commented 3 years ago

Hi, Sergey @sergpolly, I think I understand your point. You mean that if we sum up all the coverages without dividing by 2, we will count the upper triangular matrix twice, and it won't equal the total number of contacts in the matrix. This is true, and slides 20-21 are actually an excellent illustration of the point! I never thought of such a beautiful Hi-C binning representation before.

Then my question is: when do you actually need to sum up all the coverages in Hi-C analysis? If I want the total number of contacts, I take them directly from the sparse matrix. If I want the bin coverage, I usually assume the total number of pairs in which each region participates. In the absence of the main diagonal, the answer is straightforward: covs[1] += np.bincount(pixels["bin1_id"], weights=pixel_weights, minlength=n_bins) covs[1] += np.bincount(pixels["bin2_id"], weights=pixel_weights, minlength=n_bins) I believe this is the coverage definition, which @gfudenberg assumed here as well. If you introduce division by 2, then the coverage definition is something else. I struggle to formulate it, and maybe you can do it better to clarify your point? @sergpolly

Maybe we can have two coverage definitions in cooltools then. I definitely encounter the "the total number of pairs in which each region participates" definition more frequently.

sergpolly commented 3 years ago

oh, the definitions ... yeah it's get confusing quickly this way. I guess the coverage (or depth of coverage per locus) that is trying to keep sum(coverage) = # total_number_of_detected_interactions is more in line with the same definition for 1D sequencing experiments (chip-seq, wgs etc), where you're calculating how many times you've sequenced over a given nucleotide in your reference with your reads . So the definition should have been "the total number of nucleotides that were aligned to a genomic bin"... But we don't calculate nucleotides (read length isn't a parameter), and moreover with split alignments it would get even more complicated ... and inaccessible from plain pairs, unless you keep the nucleotides_aligned available (we're talking about coolers - so inaccessible)

with the cov/2 right now it's more like "the total number of unique alignments per genomic bin", in a sense that each pair typically gives you 2 unique alignments ... But this reads even more complicated Sorry - that like cov itself

Yeah ... with nucleotides it is easy to justify cov/2: there a 100bp paired-end read contributes 50 bp to one bin and 50 bp to another , so half-read or half-pair it is. Sticking to just pairs, alignments, ligations etc it is hard to define/justify cov/2 without saying half-read, half-pair etc. Maybe, number of the "contributions" of Hi-C interactions per genomic bin, that sums up to the total number of interactions genome-wide

Maybe indeed we could switch to cov, but then specify explicitly that such coverage sums up to 2X of total interactions, or that it takes each interactions into account twice

gfudenberg commented 3 years ago

coming in a bit late to this discussion, so apologies if I miss any of the points above!

I think switching to cov makes sense, b/c its a bin-level summary statistic.

This toy cooler has two reads, one intra-chromosomal and one inter-chromosomal: (1,4) and (4,6) image One the viewframe_update branch, bin1 has coverage =1, bin4 coverage=2, and bin6 coverage=1. This makes sense!

The current code looks like it still need to be modified for symmetric_upper=False, however. This is because in the bin table, a diagonal entry will only appear once even though it has two ends. See this gist: https://gist.github.com/gfudenberg/3fce48b0bde5d76858129bd742f6b796. Bin2 should have coverage=2 in this example, but with the symmetric_upper=False matrix, the result is different.

Couple other points/questions: -- I think pixel_weight_key is a slightly confusing name. Can we switch to pixel_count_key, since this function is dealing with counts? Also, can we pass this to get_coverage in case the user has their counts saved with a funny column name? -- I agree the docstring could be clarified by mentioning bins somewhere. How about: "Calculate 1D profiles of the sum per bin for cis and genome-wide contacts (aka coverage aka marginals) from a sparse Hi-C contact map in Cooler HDF5 format." -- do you think it makes sense to /2 or //2 ? Typically we'd be interested in count coverage, but perhaps /2 is more flexible even if it converts to float.

sergpolly commented 3 years ago

@gfudenberg the original argument for cov/2, I believe, is the following:

This toy cooler has two reads,

so, the sum(coverage) has to be ==2 (similar to other 1D sequencing techniques like chipseq, wgs, etc), but if you use cov - your sum(coverage)==4:

bin1 has coverage =1, bin4 coverage=2, and bin6 coverage=1.

This arguments makes the most sense, when instead of # of pairs/reads one switches to nucleotides/basepairs: i.e. if you were able to align ~500M bp to your reference, then the sum(coverage) should be 500M. With Hi-C pairs - an ideal 100 bp paired-end read contributes 50 bp to one place in the genome, and other 50 bp in some other place - hence your 0.5 pair here and 0.5 pair there (of course it gets more complicated then 50/50 sometimes).

-- I agree the docstring could be clarified by mentioning bins somewhere. How about: "Calculate 1D profiles of the sum per bin for cis and genome-wide contacts (aka coverage aka marginals) from a sparse Hi-C contact map in Cooler HDF5 format."

yeah the docstring refers to sums ... I guess that should be coverage if we stick to cov/2, or indeed switch to cov and just explain the double counting situation.

-- do you think it makes sense to /2 or //2 ? Typically we'd be interested in count coverage, but perhaps /2 is more flexible even if it converts to float.

In what context is that ? In general or for some special case ? In general - /2 for sure, otherwise we'll start "loosing" coverage, which would defy the purpose of sum(coverage)==# number of interactions

sergpolly commented 3 years ago

docstring updated, no more cov/2 , no more square special casing