pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.73k stars 17.95k forks source link

QST: Consistently apply Welford Method and Kahan Summation in roll_xxx functions #59715

Open kaixiongg opened 2 months ago

kaixiongg commented 2 months ago

Research

Link to question on StackOverflow

https://stackoverflow.com/questions/78951576/pandas-consistently-apply-welford-method-and-kahan-summation-in-roll-xxx-functi

Question about pandas

For the following functions:

1.def nancorr 2.cdef void add_var 3.cdef void add_skew 4.cdef void add_mean

It appears that both the Welford method and Kahan summation are taken into account. However, for second-order functions like correlation and variance, only the Welford method is used without Kahan summation for the means (meanx or meany). For third-order functions like skewness, only Kahan summation for the naive one-pass algorithm is employed without using Welford.

My question is: How does the Pandas community decide which method to use for stable precision? If our goal is to achieve the highest possible stability, it seems that all these functions should utilize a combination of Welford and Kahan methods.

Could you please clarify the rationale behind these choices?

Liam3851 commented 2 months ago

Seems related to #57972 and #59572 (perhaps root-cause of the above issues?).

kaixiongg commented 2 months ago

Seems related to #57972 and #59572 (perhaps root-cause of the above issues?).

So, does that mean only applying Welford for those online functions is enough? We don't need Kahan summation for those complicated multi-order functions?

Liam3851 commented 2 months ago

@kaixiongg I was just linking those issues because multiple users have noted that the pandas kurtosis functions seem less stable than similar functions from e.g. scipy. To the extent that some stability techniques are not being used that may be the root cause of the kurtosis issues reported.

kaixiongg commented 2 months ago

@kaixiongg I was just linking those issues because multiple users have noted that the pandas kurtosis functions seem less stable than similar functions from e.g. scipy. To the extent that some stability techniques are not being used that may be the root cause of the kurtosis issues reported.

Thanks for clarifying. I have also noticed that roll_var/std face similar issues. (#52407 + #54518 + #53829) Could you please reopen those issues? I would appreciate it if they could be reassigned to someone who can address them. Additionally, if there are any better algorithms or solutions available, I would be glad to hear about them.

rhshadrach commented 1 month ago

I think PRs to improve the consistency of numerical operations used would be welcome, provided they do not significantly degrade performance.

Below are my thoughts on the issue, but I haven't been too involved in this aspect in the past. cc @pandas-dev/pandas-core for any thoughts.

How does the Pandas community decide which method to use for stable precision?

I believe thus far it has been done on an ad-hoc basis, mostly dependent on whether there is a contributor interested in improving the stability of an operation.

If our goal is to achieve the highest possible stability

I would say this is not our goal. Rather, it is to balance stability and performance. However I do not know of any objective criteria utilized in doing so, to my awareness it has been done on an ad-hoc basis.

My ideal solution would be to allow the user to choose between reasonable implementations of these methods, defaulting to a numerically stable operation but allowing less stable options for more performance. However, I worry about the maintenance burden this would impose.

kaixiongg commented 1 month ago

I would say this is not our goal. Rather, it is to balance stability and performance. However I do not know of any objective criteria utilized in doing so, to my awareness it has been done on an ad-hoc basis.

I agree that our goal is to balance stability and performance. However, the precision issue is more of an optimization problem rather than a right/wrong issue. Some issues show that 'incorrect' results are due to precision loss. Even with the most precise algorithm we can implement, there will always be some corner cases where the result appears 'incorrect.' From my understanding, Welford’s method is commonly used for rolling calculations. If we were to add Kahan summation to these algorithms, we’d need to apply it to each operation, which seems like it would come with a significant performance cost, and still wouldn't completely solve the 'bug' caused by precision loss.

My ideal solution would be to allow the user to choose between reasonable implementations of these methods, defaulting to a numerically stable operation but allowing less stable options for more performance. However, I worry about the maintenance burden this would impose.

Yes, that would be great, but it seems like most open-source packages don’t take that approach.

rhshadrach commented 1 month ago

I have also noticed that roll_var/std face similar issues. (#52407 + #54518 + #53829) Could you please reopen those issues?

The two that are issues are open (and have never been closed); the third is an old PR which I believe should stay closed.

which seems like it would come with a significant performance cost...

It seems to me this should be experimented with to determine what the performance cost is.

...and still wouldn't completely solve the 'bug' caused by precision loss.

This isn't possible to do, so I don't think this should enter into consideration when evaluating a method for improving numerical stability.