raphaelvallat / pingouin

Statistical package in Python based on Pandas
https://pingouin-stats.org/
GNU General Public License v3.0
1.64k stars 139 forks source link

ValueError with large dataset #184

Closed jackransomlovell closed 3 years ago

jackransomlovell commented 3 years ago

I have a large dataset of 500k rows and 74 columns. Whenever I try to use a pairwise partial correlation I get the following error:

----> 1 data2.pairwise_corr(covar = 'SEX')

~/.local/lib/python3.7/site-packages/pandas_flavor/register.py in __call__(self, *args, **kwargs)
     27             @wraps(method)
     28             def __call__(self, *args, **kwargs):
---> 29                 return method(self._obj, *args, **kwargs)
     30 
     31         register_dataframe_accessor(method.__name__)(AccessorMethod)

~/.local/lib/python3.7/site-packages/pingouin/pairwise.py in pairwise_corr(data, columns, covar, tail, method, padjust, nan_policy)
   1229         else:
   1230             cor_st = partial_corr(data=data, x=col1, y=col2, covar=covar,
-> 1231                                   tail=tail, method=method)
   1232         cor_st_keys = cor_st.columns.tolist()
   1233 

~/.local/lib/python3.7/site-packages/pingouin/correlation.py in partial_corr(data, x, y, covar, x_covar, y_covar, tail, method, **kwargs)
    783         # PARTIAL CORRELATION
    784         cvar = np.atleast_2d(C[covar].to_numpy())
--> 785         beta_x = np.linalg.lstsq(cvar, C[x].to_numpy(), rcond=None)[0]
    786         beta_y = np.linalg.lstsq(cvar, C[y].to_numpy(), rcond=None)[0]
    787         res_x = C[x].to_numpy() - cvar @ beta_x

<__array_function__ internals> in lstsq(*args, **kwargs)

/curc/sw/anaconda3/2019.07/envs/jupyterlab2/lib/python3.7/site-packages/numpy/linalg/linalg.py in lstsq(a, b, rcond)
   2257         # lapack can't handle n_rhs = 0 - so allocate the array one larger in that axis
   2258         b = zeros(b.shape[:-2] + (m, n_rhs + 1), dtype=b.dtype)
-> 2259     x, resids, rank, s = gufunc(a, b, rcond, signature=signature, extobj=extobj)
   2260     if m == 0:
   2261         x[...] = 0

ValueError: On entry to DLASCL parameter number 4 had an illegal value

This is even with 12 cores running, is there any way to resolve this or is the pairwise correlation with a covariate just not compatible with a large dataset? pcorr() and pairwise_corr() work as methods on the dataframe.

raphaelvallat commented 3 years ago

Hi @jackransomlovell,

Interesting.. I've never ran into this issue, do you have NaN values in one or more of your columns? I'm not sure this is a performance issue, as I think NumPy should be able to handle a least-square regression with 500k data points on most modern computers. Can you try running the partial correlation only with a subset of column instead of the whole dataframe? Maybe the error is driven by one specific column, or a pair of columns (e.g. with exactly the same values).

Thanks, Raphael

jackransomlovell commented 3 years ago

Yes, there are quite a few NaN's in the form np.nan. When I use nan_policy = 'listwise' I get an error saying something related to needing more than 3 datapoints. I think I have tried it with just a subset and I was getting the same error. I will send the exact behavior later.

raphaelvallat commented 3 years ago

@jackransomlovell

I get an error saying something related to needing more than 3 datapoints.

So my guess is that you have one or more invalid columns, either with too many missing values or with identical values that lead to the error. I would therefore run the correlation on subset of features until you can identify the problematic features.

Thanks, Raphael

jackransomlovell commented 3 years ago

Thanks Raphael,

A lot of the columns are binary, i.e. 1/0 float64. Would this raise such an error? If so how would I make them valid?

On Wed, Jul 14, 2021 at 4:31 PM Raphael Vallat @.***> wrote:

@jackransomlovell https://github.com/jackransomlovell

I get an error saying something related to needing more than 3 datapoints.

So my guess is that you have one or more invalid columns, either with too many missing values or with identical values that lead to the error. I would therefore run the correlation on subset of features until you can identify the problematic features.

Thanks, Raphael

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raphaelvallat/pingouin/issues/184#issuecomment-880252614, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL7OWPH7EENVQ7EBMSWWP5LTXYF5NANCNFSM5AKL4X6Q .

jackransomlovell commented 3 years ago

I see the flaw in my thinking. I have been trying to compute pairwise correlations between a column of continuous values and a bunch of other binary variables. Is this not possible with numpy's linalg module?

On Wed, Jul 14, 2021 at 5:30 PM Jack Ransom Lovell @.***> wrote:

Thanks Raphael,

A lot of the columns are binary, i.e. 1/0 float64. Would this raise such an error? If so how would I make them valid?

On Wed, Jul 14, 2021 at 4:31 PM Raphael Vallat @.***> wrote:

@jackransomlovell https://github.com/jackransomlovell

I get an error saying something related to needing more than 3 datapoints.

So my guess is that you have one or more invalid columns, either with too many missing values or with identical values that lead to the error. I would therefore run the correlation on subset of features until you can identify the problematic features.

Thanks, Raphael

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raphaelvallat/pingouin/issues/184#issuecomment-880252614, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL7OWPH7EENVQ7EBMSWWP5LTXYF5NANCNFSM5AKL4X6Q .

raphaelvallat commented 3 years ago

@jackransomlovell I don't think that the binary variables are the issue -- although using partial correlation with binary dependent variables is not ideal (as discussed in https://github.com/raphaelvallat/pingouin/issues/147). Do you have any np.inf in your data? This seems to be the most likely explanation (see https://github.com/statsmodels/statsmodels/issues/5396).

As an action item for Pingouin, we are currently using NumPy to run the least-square regression:

https://github.com/raphaelvallat/pingouin/blob/ab5662b92ab730d50c430cf44703c4c2b62402fc/pingouin/correlation.py#L782-L788

and I think it might be better to use scipy.linalg.lstsq with check_finite=True:

image

YuliaGaz commented 3 years ago

Hi! I have the same error from rm_corr(data, x, y, subject) and my dataset certainly has no NaN or inf. x and y are integers from 0.0 to 7.0

raphaelvallat commented 3 years ago

Hi @YuliaGaz,

Could you please provide a screenshot of the exact error as well as the code / dataset required to reproduce the error? (If this is sensitive data, feel free to DM me the data on Gitter).

Thanks, Raphael

YuliaGaz commented 3 years ago

Hi @raphaelvallat, My data is sensitive so I shared it with you through Gitter Here I also attach a screenshot with the exact error for_github

As I said before my data has no NaN or inf values. My colleague offered a way to work around this error by adding tiny noise to one or both of measured values. Like:

data.value_1 = data.value_1 + np.random.randn(len(data)) * 1e-12

This solves the problem of the initial error but still gives a RankWarning: RankWarning: Polyfit may be poorly conditioned aov = ancova(dv=y, covar=x, between=subject, data=data)

raphaelvallat commented 3 years ago

Hi @YuliaGaz,

I have just looked at the data and it's because one of your subject (119) only has one time point, thus leading to an invalid linear regression with a single pair of (x, y) in numpy.polyfit. Accordingly, removing this subject solves the issue:

pg.rm_corr(df[df['subject'] != 119], "value_1", "value_2", "subject")

Action items:

Thanks, Raphael

YuliaGaz commented 3 years ago

Hi @raphaelvallat, Thank you for your very fast feedback! You are right. However, this dataset was just a piece of my main dataset: I have chosen a small subset which still gives the same error. And unfortunately this subset has one subject with only 1 row. But my main dataset gives the same error and all subjects there have more than 1 unique rows. image I have sent the full dataset to you in Gitter.

If it is helpful: rmcorr from R calculates the correlation for this big dataset without any problems and its result is identical to what I got from pg.rm_corr with the small trick with adding noise (from my previous message).

Best, Yulia

raphaelvallat commented 3 years ago

HI @YuliaGaz,

Thanks for sharing. You're right, what's happening is that for at least one of your subjects, all the values in value_1 or value_2 are 0, which leads to a LinAlg error in numpy.polyfit. There's no such error in the R package because it uses a different, more efficient approach to calculate the ANCOVA/linear regression that can handle such cases. I've implemented the ANCOVA function a long time ago and I think it could be improved.

Action item for Pingouin:

Thanks again, Raphael

YuliaGaz commented 3 years ago

Thank you Raphael! I like your package very much and I hope that it will be even cooler in future

Best wishes, Yulia

jackransomlovell commented 3 years ago

Hi all,

I think the problem I am facing has to deal with participants only have 0 for some columns as well. I'll try to investigate a solution. Thanks for the help

Jack

On Tue, Aug 3, 2021 at 2:42 AM Yulia Gazizova @.***> wrote:

Thank you Raphael! I like your package very much and I hope that it will be even cooler in future

Best wishes, Yulia

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raphaelvallat/pingouin/issues/184#issuecomment-891578745, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL7OWPBI3KKEYY2TV3OK2YLT26FWDANCNFSM5AKL4X6Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

raphaelvallat commented 3 years ago

Hi @jackransomlovell,

The next release of Pingouin will include a major refactoring of the partial correlation function (see commit) which should, I believe, work even when the data has only zero. If you're familiar with Git, could you clone Pingouin, switch to the develop branch and re-try your code again?

Thanks, Raphael

jackransomlovell commented 3 years ago

Sure, I am out right now but will check when I get back. Thank you for pushing your developments!

On Thu, Aug 5, 2021 at 1:39 PM Raphael Vallat @.***> wrote:

Hi @jackransomlovell https://github.com/jackransomlovell,

The next release of Pingouin will include a major refactoring of the partial correlation function (see commit https://github.com/raphaelvallat/pingouin/commit/81d1aafa0826c34e3ce8ed499a87ae5ad86843d1) which should, I believe, work even when the data has only zero. If you're familiar with Git, could you clone Pingouin, switch to the develop branch and re-try your code again?

Thanks, Raphael

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raphaelvallat/pingouin/issues/184#issuecomment-893655847, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL7OWPHSYKDH24JZJIJ6VD3T3LEEPANCNFSM5AKL4X6Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

jackransomlovell commented 3 years ago

Tested it on the develop branch and got the following error:

ValueError: array must not contain infs or NaNs

On Thu, Aug 5, 2021 at 2:01 PM Jack Ransom Lovell @.***> wrote:

Sure, I am out right now but will check when I get back. Thank you for pushing your developments!

On Thu, Aug 5, 2021 at 1:39 PM Raphael Vallat @.***> wrote:

Hi @jackransomlovell https://github.com/jackransomlovell,

The next release of Pingouin will include a major refactoring of the partial correlation function (see commit https://github.com/raphaelvallat/pingouin/commit/81d1aafa0826c34e3ce8ed499a87ae5ad86843d1) which should, I believe, work even when the data has only zero. If you're familiar with Git, could you clone Pingouin, switch to the develop branch and re-try your code again?

Thanks, Raphael

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raphaelvallat/pingouin/issues/184#issuecomment-893655847, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL7OWPHSYKDH24JZJIJ6VD3T3LEEPANCNFSM5AKL4X6Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

jackransomlovell commented 3 years ago

There is a large number of NaNs within the dataset, but shouldn't that be taken care of?

On Thu, Aug 5, 2021 at 4:15 PM Jack Ransom Lovell @.***> wrote:

Tested it on the develop branch and got the following error:

ValueError: array must not contain infs or NaNs

On Thu, Aug 5, 2021 at 2:01 PM Jack Ransom Lovell < @.***> wrote:

Sure, I am out right now but will check when I get back. Thank you for pushing your developments!

On Thu, Aug 5, 2021 at 1:39 PM Raphael Vallat @.***> wrote:

Hi @jackransomlovell https://github.com/jackransomlovell,

The next release of Pingouin will include a major refactoring of the partial correlation function (see commit https://github.com/raphaelvallat/pingouin/commit/81d1aafa0826c34e3ce8ed499a87ae5ad86843d1) which should, I believe, work even when the data has only zero. If you're familiar with Git, could you clone Pingouin, switch to the develop branch and re-try your code again?

Thanks, Raphael

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raphaelvallat/pingouin/issues/184#issuecomment-893655847, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL7OWPHSYKDH24JZJIJ6VD3T3LEEPANCNFSM5AKL4X6Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

raphaelvallat commented 3 years ago

Could you take a screenshot of the error?

That's surprising because NaN should be automatically removed:

https://github.com/raphaelvallat/pingouin/blob/e56df016966f34c9d2f7cca882e84382dcff4d2d/pingouin/correlation.py#L788-L792

jackransomlovell commented 3 years ago

I agree, maybe I was not checkouted in the right repo when I called pip install. but here is the error, I suspect I did install the right version as scipy is being used instead of numpy...

[image: image.png]

On Thu, Aug 5, 2021 at 5:23 PM Raphael Vallat @.***> wrote:

Could you take a screenshot of the error?

That's surprising because NaN should be automatically removed:

https://github.com/raphaelvallat/pingouin/blob/e56df016966f34c9d2f7cca882e84382dcff4d2d/pingouin/correlation.py#L788-L792

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raphaelvallat/pingouin/issues/184#issuecomment-893818232, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL7OWPCMDTT7LZERJJ5S2BTT3L6LHANCNFSM5AKL4X6Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

jackransomlovell commented 3 years ago

Rebuilt and now it is working! Thank you so much Raphael! Just want to confirm, correlating a continuous with a binary measure (1,0) is statistically sound if the data type is just an integer, right?

[image: image.png]

On Thu, Aug 5, 2021 at 5:49 PM Jack Ransom Lovell @.***> wrote:

I agree, maybe I was not checkouted in the right repo when I called pip install. but here is the error, I suspect I did install the right version as scipy is being used instead of numpy...

[image: image.png]

On Thu, Aug 5, 2021 at 5:23 PM Raphael Vallat @.***> wrote:

Could you take a screenshot of the error?

That's surprising because NaN should be automatically removed:

https://github.com/raphaelvallat/pingouin/blob/e56df016966f34c9d2f7cca882e84382dcff4d2d/pingouin/correlation.py#L788-L792

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raphaelvallat/pingouin/issues/184#issuecomment-893818232, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL7OWPCMDTT7LZERJJ5S2BTT3L6LHANCNFSM5AKL4X6Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

raphaelvallat commented 3 years ago

Hi @jackransomlovell,

Great to hear! Yeah that will work, though it might be more meaningful to perform a T-test to compare the two groups instead of a correlation.

raphaelvallat commented 3 years ago

This has been fixed in the new stable version of Pingouin (v0.4.0). Please make sure to upgrade with pip install --upgrade pingouin.

Thanks, Closing the issue.