maclandrol / FisherExact

Fisher exact test for mxn contingency table in python
MIT License
37 stars 12 forks source link

OverflowError: Python int too large to convert to C long #3

Open dterg opened 7 years ago

dterg commented 7 years ago

Running Monte Carlo simulations for more than 2x3:

  pval = fisher_exact(freqs[:, 0:3], simulate_pval=True)

gives an overflow error:

  OverflowError: Python int too large to convert to C long

Running Fisher Exact without simulations for more than 2x3 gives the error:

  Call-back cb_f2pystop_in_prterr__user__routines failed.
  capi_return is NULL
maclandrol commented 7 years ago

Can you post your contingency table ?

Without simulation, when your entries are too large, there is an out of memory exception that directly kill python.

dterg commented 7 years ago

Sure thing:

  np.array([[23, 56, 39, 34, 1906, 203, 20, 23, 19, 1793, 204, 32, 88, 408, 1530, 415],[205, 86, 32, 25, 1710, 276, 120, 168, 122, 1372, 475, 6, 29, 324, 1699, 329]])

I came across your post on a blog comment about python being killed by fortran. So I've tried simulating with Monte Carlo. Works absolutely fine when the contingency matrix is small but when it exceeds 3 columns, I get the overflow error.

At first I thought my numbers are maybe too large but checking using sys.maxsize I get "9223372036854775807L" and my biggest number in my contingency is 1699 so don't think the issue is the handling of the matrix with numpy (int dtype) and such.

maclandrol commented 7 years ago

If you are using a windows computer, long is 32bit instead of 64 (check sys.maxint). But the main problem is your table. To compute the p-value, there is a need to compute factorial(np.sum(array)), which is too large in your case. That's the reason why the exact version failed because it run out of memory. I'm thinking about writing a better version because, right now, the exception raised can't be catched and end in a segmentation fault.

Even though the monte carlo simulation is using an efficient way to compute alternative tables with the same column count, the values are still too high (limit is 5000 and your columns count are [6793, 6978]). This is fixable (although the impact on memory use should be assessed before). Will do it as soon as I'm a little free this week. Test success will then depend on the amount of workspace set.
I wasn't able to reproduce the OverFlow problem, but will look into it. Can you please indicate the full error message (line where the exception occur). Thanks

In the meantime, may I recommend you to use chi2 or G-test instead ?

dterg commented 7 years ago

Thanks for the explanation. I am currently using Chi2, but the error is thrown in the following functions and lines:

  File "C:\Users\~\AppData\Local\Continuum\Anaconda2\lib\site-packages\FisherExact\Fisher.py", line 130, in fisher_exact
  tmp_res = _fisher_sim(c, replicate, seed)

  File "C:\Users\~\AppData\Local\Continuum\Anaconda2\lib\site-packages\FisherExact\Fisher.py", line 210, in _fisher_sim
  seed = np.array([seed], dtype='int32')
  OverflowError: Python int too large to convert to C long`

With MC simulation and explicitly stating the seed, I (think) I got around this. Must be something with the default type casting of the seed argument? Although now I'm sometimes getting an error, which I'm looking into:

  ValueError: Fortran subroutine rcont2 return an error !
maclandrol commented 7 years ago

The overflow error and table limit for simulation should be fixed with commit b1ec15876ce1f3216488c7a008b45a0c26e3a68b. However, I'm unable to test it on windows.

If you have some time, let me know if it worked and I will update the pip version

On another note, the MC simulation return p-value as pval = (r+1)/(n+1) see this article. Where r is the number of replicates with values statistic lower than your current table and n is the total number of simulated alternative tables.

In your case, r=0, even with 150000 simulations. This is difficult to interpret and will give you highly different pvalue depending on the number of replicates requested. Either the chance of generating a table with greater stat than your current table is too low, or the heuristic used for generating alternative table is inefficient for larger table sum. I will investigate the latter.

dterg commented 7 years ago

Thanks for the explanation, article reference and your time. I will test it this weekend and get back to you.

zhiyzuo commented 7 years ago

I also got this error: ValueError: Fortran subroutine rcont2 return an error !. Any suggestions?

Thanks!

maclandrol commented 7 years ago

@zhiyzuo. It would be helpful to have your input data. Also I planned to rewrite everything in Python soon.