Some issues with calculateFraction and calculatePvalue

lorenzoamir commented 2 months ago

Hi, I was trying to compare some WGCNA objects and I believe I noticed a few issues in the comparison.

Issues in CalculateFraction:

The fractions are not really fractions, since they get converted to percentages. However, with the method being named calculateFraction, this is a bit confusing.
The diagonal of the fraction matrix is set to 1 instead of 100, so it's not consistent with the rest of the matrix representing percentages instead of fractions

Issues in calculatePvalue:

The function fisher_exact is called with alternative='two-sided', this means that low p-values are not just obtained for modules with significant overlap, but also for modules with strong mutual exclusivity, for example the matrix in the tutorial only contains $p=0$

Proposed fix:

Change percentages to fractions in CalculateFraction
Change CalculatePvalue accordingly
Set alternative='less for Fisher's test in CalculatePvalue

If you agree with the proposed fix, I can open a pull request. Just let me know.

nargesr commented 2 months ago

Hi @lorenzoamir

Thank you for catching them.

regarding the problem with calculateFraction(): I would change the percentage for the rest of the value to be a fraction. I'm trying to avoid making any changes in the function name, if possible.

regarding the problem with calculatePvalue(): I would prefer to keep it this way since I'm more interested in the pair of modules that have strong mutual overlap but we can add one input parameter (alternative='two-sided') to be able to change for those who prefer another alternative hypothesis.

If you agree with me and want to fix these as I proposed, please let me know and I will wait for you to open a pull request to fix this.

Best, Narges

lorenzoamir commented 2 months ago

Hi, I agree with using fractions and not changing any function name.

Regarding the p-values, I said we should use alternative='less', but I was wrong, the correct one is actually alternative='greater'. The problem with alternative='two-sided' (the current default), is that it will pick up both modules with more overlapping genes than expected (what we want to detect) and modules with less overlapping genes than expected (which don't look particularly interesting to me). I have made a small code example to show this. I have created two pairs of modules, the first one has many overlapping genes, the second one only has one and tested the different alternatives, the ideal outcome is that the first pair should be significant and the second one should not:

Case1: high overlap
    two-sided:
    p_val:  0.02913752913752914
    greater:
    p_val:  0.01456876456876457
    less:
    p_val:  0.9997086247086248

Case1: low overlap
    two-sided:
    p_val:  0.02913752913752914
    greater:
    p_val:  0.9997086247086248
    less:
    p_val:  0.01456876456876457

As you can see alternative='grater' is the one that only counts modules with high overlap as significant, while alternative='two-sided' considers both. I think we should use greater, since its probably what the user will expect when calling the function.

nargesr commented 1 month ago

okay sounds good! but I still prefer to add this as an input parameter so people can change it.

mortazavilab / PyWGCNA

Some issues with calculateFraction and calculatePvalue #119