mortazavilab / PyWGCNA

PyWGCNA is a Python package designed to do Weighted Gene Correlation Network analysis (WGCNA)
https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad415/7218311
MIT License
217 stars 53 forks source link

Some issues with calculateFraction and calculatePvalue #119

Closed lorenzoamir closed 1 month ago

lorenzoamir commented 2 months ago

Hi, I was trying to compare some WGCNA objects and I believe I noticed a few issues in the comparison.

Issues in CalculateFraction:

Issues in calculatePvalue:

Proposed fix:

If you agree with the proposed fix, I can open a pull request. Just let me know.

nargesr commented 2 months ago

Hi @lorenzoamir

Thank you for catching them.

regarding the problem with calculateFraction(): I would change the percentage for the rest of the value to be a fraction. I'm trying to avoid making any changes in the function name, if possible.

regarding the problem with calculatePvalue(): I would prefer to keep it this way since I'm more interested in the pair of modules that have strong mutual overlap but we can add one input parameter (alternative='two-sided') to be able to change for those who prefer another alternative hypothesis.

If you agree with me and want to fix these as I proposed, please let me know and I will wait for you to open a pull request to fix this.

Best, Narges

lorenzoamir commented 2 months ago

Hi, I agree with using fractions and not changing any function name.

Regarding the p-values, I said we should use alternative='less', but I was wrong, the correct one is actually alternative='greater'. The problem with alternative='two-sided' (the current default), is that it will pick up both modules with more overlapping genes than expected (what we want to detect) and modules with less overlapping genes than expected (which don't look particularly interesting to me). I have made a small code example to show this. I have created two pairs of modules, the first one has many overlapping genes, the second one only has one and tested the different alternatives, the ideal outcome is that the first pair should be significant and the second one should not:

Case1: high overlap
    two-sided:
    p_val:  0.02913752913752914
    greater:
    p_val:  0.01456876456876457
    less:
    p_val:  0.9997086247086248

Case1: low overlap
    two-sided:
    p_val:  0.02913752913752914
    greater:
    p_val:  0.9997086247086248
    less:
    p_val:  0.01456876456876457

As you can see alternative='grater' is the one that only counts modules with high overlap as significant, while alternative='two-sided' considers both. I think we should use greater, since its probably what the user will expect when calling the function.

nargesr commented 1 month ago

okay sounds good! but I still prefer to add this as an input parameter so people can change it.