vaquerizaslab / fanc

FAN-C: Framework for the ANalysis of C-like data
GNU General Public License v3.0
106 stars 14 forks source link

geometric_mean enabled outputs all NA/inf #135

Open caragraduate opened 1 year ago

caragraduate commented 1 year ago

Hi there,

Thank you for developing this great tool! I recently met a problem as I followed your suggestion, "geometric_mean – Use a geometric mean instead of arithmetic. If using log-transformed, and if you intend to subtract scores from different samples for comparison, this is recommended". After I enabled this geometric_mean = True, almost all of my insulation scores across the genome returned as NAinf (just one or two chromosomes have values among several bins, not all..).

Do you have any idea about how to fix this issue? I used to run as default without geometric_mean; everything is normal. So I guess this is not about the data itself?

Thank you for any help in advance!

Cara

kaukrise commented 1 year ago

Hi,

this probably happens when you have 0s in your insulation scores, which can arise from unmasked, poorly mappable regions. Can you please try the following version? It omits 0s from the geometric mean calculation.

fanc-0.9.26b2.tar.gz

pip install fanc-0.9.26b2.tar.gz

Thanks! Kai

caragraduate commented 1 year ago

Great! I just tried it on one sample and most of the scores look normal now, but I do have two follow-up questions. I also enabled impute_missing=True, but I found there still some NA value returned for some region, which is not what I expected, I was thinking these NA should all be imputed? In this regard, I wonder how this impute method works in your tool?

Another question is that I enabled geometric_mean because I would like to do scores comparison among samples, however, when I use the insulation score algorithm to detect TAD boundaries among those samples, do you suggest to still use geometric mean or arithmetic mean as the default I assume?

Thank you! Cara

kaukrise commented 1 year ago

Regarding the NA values - these are probably the values that have previously been 0 in your data, which caused the geometric mean to fail. Imputation works on the matrix values (it simply replaces them with the expected value), not the insulation score. Therefore I think it is still appropriate to have NaN in those places. In general, when working with in situ Hi-C or similar methods, you should never have all-zero bins. When using FAN-C to create these matrices, all-zero bins are masked, but that might not be the case for other tools.

Regarding the geometric mean normalisation: the original definition uses the arithmetic mean to normalise when detecting TAD boundaries, which is why it is the default in FAN-C. However, I think the geometric mean is also suitable as a normalisation in general.

caragraduate commented 1 year ago

Regarding the NA values - these are probably the values that have previously been 0 in your data, which caused the geometric mean to fail. Imputation works on the matrix values (it simply replaces them with the expected value), not the insulation score. Therefore I think it is still appropriate to have NaN in those places. In general, when working with in situ Hi-C or similar methods, you should never have all-zero bins. When using FAN-C to create these matrices, all-zero bins are masked, but that might not be the case for other tools.

Thank you for your detailed explanation. May I confirm again: 1). as I used your latest version [fanc-0.9.26b2.tar.gz], in which you mentioned you had omitted 0s from the geometric mean calculation, why does it still cause NA values or failure of the geometric mean, in other words ? 2). I understand the imputation is working on the matrix values but not the insulation score itself. I wonder which specific value in the .hic matrix you are imputed for? You mentioned the imputation will replace the missing value in the matrix to become the expected value; how does the expected value calculate? And then, if there is still NaN in some bins, are you indicating that this is because the insulation scores over those bins are 0, even if they are imputed? (This might bring up about how the geometric mean normalization is done in your tool). Thank you in advance for any incoming clarification!

Regarding the geometric mean normalisation: the original definition uses the arithmetic mean to normalise when detecting TAD boundaries, which is why it is the default in FAN-C. However, I think the geometric mean is also suitable as a normalisation in general.

This helped a lot to make me clear; I appreciate it!

kaukrise commented 1 year ago

First of all, to clarify: FAN-C imputes missing values, not 0s.

The order of operations when using imputation and geometric mean is:

1) Calculation of expected values (=mean of interactions at the same distance on the same chromosome) 2) Replacement of missing values (NaN) in the Hi-C matrix by their respective expected value 3) Calculation of insulation score for each bin - this might result in 0s if there are no interactions in the "insulation square" 4) In case of geometric mean: replacement of 0s in insulation score with NaN, as otherwise the geometric mean will equate to 0 5) Calculate chromosomal geometric mean of insulation scores 6) Divide each insulation score by geometric mean

For what it's worth, I'd recommend against imputation - at best it is a guess about Hi-C interaction values in a specific region, at worst it can be actively misleading.