Closed jackransomlovell closed 3 years ago
Hi @jackransomlovell,
Interesting.. I've never ran into this issue, do you have NaN values in one or more of your columns? I'm not sure this is a performance issue, as I think NumPy should be able to handle a least-square regression with 500k data points on most modern computers. Can you try running the partial correlation only with a subset of column instead of the whole dataframe? Maybe the error is driven by one specific column, or a pair of columns (e.g. with exactly the same values).
Thanks, Raphael
Yes, there are quite a few NaN's in the form np.nan. When I use nan_policy = 'listwise'
I get an error saying something related to needing more than 3 datapoints. I think I have tried it with just a subset and I was getting the same error. I will send the exact behavior later.
@jackransomlovell
I get an error saying something related to needing more than 3 datapoints.
So my guess is that you have one or more invalid columns, either with too many missing values or with identical values that lead to the error. I would therefore run the correlation on subset of features until you can identify the problematic features.
Thanks, Raphael
Thanks Raphael,
A lot of the columns are binary, i.e. 1/0 float64. Would this raise such an error? If so how would I make them valid?
On Wed, Jul 14, 2021 at 4:31 PM Raphael Vallat @.***> wrote:
@jackransomlovell https://github.com/jackransomlovell
I get an error saying something related to needing more than 3 datapoints.
So my guess is that you have one or more invalid columns, either with too many missing values or with identical values that lead to the error. I would therefore run the correlation on subset of features until you can identify the problematic features.
Thanks, Raphael
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raphaelvallat/pingouin/issues/184#issuecomment-880252614, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL7OWPH7EENVQ7EBMSWWP5LTXYF5NANCNFSM5AKL4X6Q .
I see the flaw in my thinking. I have been trying to compute pairwise correlations between a column of continuous values and a bunch of other binary variables. Is this not possible with numpy's linalg module?
On Wed, Jul 14, 2021 at 5:30 PM Jack Ransom Lovell @.***> wrote:
Thanks Raphael,
A lot of the columns are binary, i.e. 1/0 float64. Would this raise such an error? If so how would I make them valid?
On Wed, Jul 14, 2021 at 4:31 PM Raphael Vallat @.***> wrote:
@jackransomlovell https://github.com/jackransomlovell
I get an error saying something related to needing more than 3 datapoints.
So my guess is that you have one or more invalid columns, either with too many missing values or with identical values that lead to the error. I would therefore run the correlation on subset of features until you can identify the problematic features.
Thanks, Raphael
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raphaelvallat/pingouin/issues/184#issuecomment-880252614, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL7OWPH7EENVQ7EBMSWWP5LTXYF5NANCNFSM5AKL4X6Q .
@jackransomlovell I don't think that the binary variables are the issue -- although using partial correlation with binary dependent variables is not ideal (as discussed in https://github.com/raphaelvallat/pingouin/issues/147). Do you have any np.inf in your data? This seems to be the most likely explanation (see https://github.com/statsmodels/statsmodels/issues/5396).
As an action item for Pingouin, we are currently using NumPy to run the least-square regression:
and I think it might be better to use scipy.linalg.lstsq with check_finite=True
:
Hi! I have the same error from rm_corr(data, x, y, subject) and my dataset certainly has no NaN or inf. x and y are integers from 0.0 to 7.0
Hi @YuliaGaz,
Could you please provide a screenshot of the exact error as well as the code / dataset required to reproduce the error? (If this is sensitive data, feel free to DM me the data on Gitter).
Thanks, Raphael
Hi @raphaelvallat, My data is sensitive so I shared it with you through Gitter Here I also attach a screenshot with the exact error
As I said before my data has no NaN or inf values. My colleague offered a way to work around this error by adding tiny noise to one or both of measured values. Like:
data.value_1 = data.value_1 + np.random.randn(len(data)) * 1e-12
This solves the problem of the initial error but still gives a RankWarning: RankWarning: Polyfit may be poorly conditioned aov = ancova(dv=y, covar=x, between=subject, data=data)
Hi @YuliaGaz,
I have just looked at the data and it's because one of your subject (119) only has one time point, thus leading to an invalid linear regression with a single pair of (x, y) in numpy.polyfit. Accordingly, removing this subject solves the issue:
pg.rm_corr(df[df['subject'] != 119], "value_1", "value_2", "subject")
Action items:
Thanks, Raphael
Hi @raphaelvallat, Thank you for your very fast feedback! You are right. However, this dataset was just a piece of my main dataset: I have chosen a small subset which still gives the same error. And unfortunately this subset has one subject with only 1 row. But my main dataset gives the same error and all subjects there have more than 1 unique rows. I have sent the full dataset to you in Gitter.
If it is helpful: rmcorr from R calculates the correlation for this big dataset without any problems and its result is identical to what I got from pg.rm_corr with the small trick with adding noise (from my previous message).
Best, Yulia
HI @YuliaGaz,
Thanks for sharing. You're right, what's happening is that for at least one of your subjects, all the values in value_1
or value_2
are 0, which leads to a LinAlg error in numpy.polyfit. There's no such error in the R package because it uses a different, more efficient approach to calculate the ANCOVA/linear regression that can handle such cases. I've implemented the ANCOVA function a long time ago and I think it could be improved.
Action item for Pingouin:
Thanks again, Raphael
Thank you Raphael! I like your package very much and I hope that it will be even cooler in future
Best wishes, Yulia
Hi all,
I think the problem I am facing has to deal with participants only have 0 for some columns as well. I'll try to investigate a solution. Thanks for the help
Jack
On Tue, Aug 3, 2021 at 2:42 AM Yulia Gazizova @.***> wrote:
Thank you Raphael! I like your package very much and I hope that it will be even cooler in future
Best wishes, Yulia
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raphaelvallat/pingouin/issues/184#issuecomment-891578745, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL7OWPBI3KKEYY2TV3OK2YLT26FWDANCNFSM5AKL4X6Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
Hi @jackransomlovell,
The next release of Pingouin will include a major refactoring of the partial correlation function (see commit) which should, I believe, work even when the data has only zero. If you're familiar with Git, could you clone Pingouin, switch to the develop branch and re-try your code again?
Thanks, Raphael
Sure, I am out right now but will check when I get back. Thank you for pushing your developments!
On Thu, Aug 5, 2021 at 1:39 PM Raphael Vallat @.***> wrote:
Hi @jackransomlovell https://github.com/jackransomlovell,
The next release of Pingouin will include a major refactoring of the partial correlation function (see commit https://github.com/raphaelvallat/pingouin/commit/81d1aafa0826c34e3ce8ed499a87ae5ad86843d1) which should, I believe, work even when the data has only zero. If you're familiar with Git, could you clone Pingouin, switch to the develop branch and re-try your code again?
Thanks, Raphael
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raphaelvallat/pingouin/issues/184#issuecomment-893655847, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL7OWPHSYKDH24JZJIJ6VD3T3LEEPANCNFSM5AKL4X6Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
Tested it on the develop branch and got the following error:
ValueError: array must not contain infs or NaNs
On Thu, Aug 5, 2021 at 2:01 PM Jack Ransom Lovell @.***> wrote:
Sure, I am out right now but will check when I get back. Thank you for pushing your developments!
On Thu, Aug 5, 2021 at 1:39 PM Raphael Vallat @.***> wrote:
Hi @jackransomlovell https://github.com/jackransomlovell,
The next release of Pingouin will include a major refactoring of the partial correlation function (see commit https://github.com/raphaelvallat/pingouin/commit/81d1aafa0826c34e3ce8ed499a87ae5ad86843d1) which should, I believe, work even when the data has only zero. If you're familiar with Git, could you clone Pingouin, switch to the develop branch and re-try your code again?
Thanks, Raphael
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raphaelvallat/pingouin/issues/184#issuecomment-893655847, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL7OWPHSYKDH24JZJIJ6VD3T3LEEPANCNFSM5AKL4X6Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
There is a large number of NaNs within the dataset, but shouldn't that be taken care of?
On Thu, Aug 5, 2021 at 4:15 PM Jack Ransom Lovell @.***> wrote:
Tested it on the develop branch and got the following error:
ValueError: array must not contain infs or NaNs
On Thu, Aug 5, 2021 at 2:01 PM Jack Ransom Lovell < @.***> wrote:
Sure, I am out right now but will check when I get back. Thank you for pushing your developments!
On Thu, Aug 5, 2021 at 1:39 PM Raphael Vallat @.***> wrote:
Hi @jackransomlovell https://github.com/jackransomlovell,
The next release of Pingouin will include a major refactoring of the partial correlation function (see commit https://github.com/raphaelvallat/pingouin/commit/81d1aafa0826c34e3ce8ed499a87ae5ad86843d1) which should, I believe, work even when the data has only zero. If you're familiar with Git, could you clone Pingouin, switch to the develop branch and re-try your code again?
Thanks, Raphael
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raphaelvallat/pingouin/issues/184#issuecomment-893655847, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL7OWPHSYKDH24JZJIJ6VD3T3LEEPANCNFSM5AKL4X6Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
Could you take a screenshot of the error?
That's surprising because NaN should be automatically removed:
I agree, maybe I was not checkouted in the right repo when I called pip install. but here is the error, I suspect I did install the right version as scipy is being used instead of numpy...
[image: image.png]
On Thu, Aug 5, 2021 at 5:23 PM Raphael Vallat @.***> wrote:
Could you take a screenshot of the error?
That's surprising because NaN should be automatically removed:
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raphaelvallat/pingouin/issues/184#issuecomment-893818232, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL7OWPCMDTT7LZERJJ5S2BTT3L6LHANCNFSM5AKL4X6Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
Rebuilt and now it is working! Thank you so much Raphael! Just want to confirm, correlating a continuous with a binary measure (1,0) is statistically sound if the data type is just an integer, right?
[image: image.png]
On Thu, Aug 5, 2021 at 5:49 PM Jack Ransom Lovell @.***> wrote:
I agree, maybe I was not checkouted in the right repo when I called pip install. but here is the error, I suspect I did install the right version as scipy is being used instead of numpy...
[image: image.png]
On Thu, Aug 5, 2021 at 5:23 PM Raphael Vallat @.***> wrote:
Could you take a screenshot of the error?
That's surprising because NaN should be automatically removed:
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/raphaelvallat/pingouin/issues/184#issuecomment-893818232, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL7OWPCMDTT7LZERJJ5S2BTT3L6LHANCNFSM5AKL4X6Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
Hi @jackransomlovell,
Great to hear! Yeah that will work, though it might be more meaningful to perform a T-test to compare the two groups instead of a correlation.
This has been fixed in the new stable version of Pingouin (v0.4.0). Please make sure to upgrade with pip install --upgrade pingouin
.
Thanks, Closing the issue.
I have a large dataset of 500k rows and 74 columns. Whenever I try to use a pairwise partial correlation I get the following error:
This is even with 12 cores running, is there any way to resolve this or is the pairwise correlation with a covariate just not compatible with a large dataset? pcorr() and pairwise_corr() work as methods on the dataframe.