pyranges / pyranges_1.x

PyRanges as a DataFrame subclass.
MIT License
9 stars 2 forks source link

action check is broken #35

Open marco-mariotti opened 3 weeks ago

marco-mariotti commented 3 weeks ago

See https://github.com/pyranges/pyranges_1.x/actions/runs/9593389507 This is due to fisher, see https://github.com/brentp/fishers_exact_test/issues/48

Must re-run / adapt eventually, depending on whether the issue is resolved rapidly in fisher

endrebak commented 3 weeks ago

I suggest removing the fisher dependency and turning off the doctests for now. Or lock an earlier version of fisher

marco-mariotti commented 3 weeks ago

weirdly enough, earlier versions of fisher have the same problem. It must be because some of its dependencies latests versions are causing the problem...? Because fisher is a fringe add-on, used only in one function, I wonder if it can be removed permanently. The performance trade-off, even when used, is likely minimal. It should be easy to replace with a python core/numpy function, something already present as pyranges dep. But I can't look at this for several days

endrebak commented 3 weeks ago

It is OK with me. It is already an optional add-on. I could not find an alternative that was not silly slow

endrebak commented 3 weeks ago

When you have a lot of regions to test the other alternatives are very slow. I guess I could write something myself in C/Cython, but it might not be an optimal use of my time.

If you find something that is fast please tell me

endrebak commented 3 weeks ago

SciPy: 266 µs ± 7.13 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) Fisher: 280 ns ± 0.721 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Another problem is that SciPy only works on one table at a time but Fisher is vectorized. So the SciPy version is orders of magnitudes slower. I use Fisher exact a lot in my thesis so the difference matters.

I suggest we implement it with scipy as the default and use the fisher library only if it is installed.

I do not agree that fisher is niche, it is a common statistical test and needed for functional enrichment which is ubiquitous in bioinformatics.

endrebak commented 3 weeks ago

The performance difference was seconds compared to tens of minutes IIRC due to SciPy needing Python looping and reading data from random memory addresses.

marco-mariotti commented 3 weeks ago

OK, then fisher package stays. It's already optional, but optional add-ons are required for testing and thus to proceed with any workflows. I think it's healthy that it stays that way. It seems an easy fix to the fisher package. I submitted a PR, should be solved soon.

endrebak commented 3 weeks ago

I am not going to insist. I could make statistics an add-on package that needs to be imported separately. That is not a bad idea actually.