mdhaber / scipy

Scipy library main repository
http://scipy.org/scipylib/
BSD 3-Clause "New" or "Revised" License
1 stars 5 forks source link

A Solid Foundation for Statistics in Python with SciPy #26

Closed mdhaber closed 1 year ago

mdhaber commented 4 years ago

Overview of "A Solid Foundation for Statistics in Python with SciPy".

Expand tools for the analysis of variance

New Statistical Tests

Improve Existing Tests

Fitting Probability Distributions to Data

New Probability Distributions

Improve underlying code for PDF and CDF calculations

Decrease Open Statistics Issues By the end of the project, we want the number of open stats issues to be below 282 (number of open stats issues on 3/18/2020), and preferably under 261 (number of open stats issues on 3/18/2020 created before project start date 2/1/2020). This is @mdhaber's list of issues to watch/fix; none need to be closed to finish the project, but it would be great to make a dent.

Outreach Event

Other

rlucas7 commented 4 years ago

@mdhaber you can checkoff the box for the 'check for NaN in spearmanrho' now

mdhaber commented 4 years ago

Yup, thanks!

rlucas7 commented 4 years ago

@mdhaber you can checkoff box for the multivariate t distribution now too.

WarrenWeckesser commented 4 years ago

https://github.com/scipy/scipy/pull/11119 was merged, so you can check off the new cramervonmises test.

WarrenWeckesser commented 4 years ago

PR for the relative risk: https://github.com/scipy/scipy/pull/13048

mdhaber commented 4 years ago

@WarrenWeckesser Two more weeks left in the quarter...

rlucas7 commented 4 years ago

@mdhaber you can check off the multivariate hypergeometric box:

 multivariate hypergeometric distribution - scipy#12585, scipy#12839 (@mdhaber)
caos21 commented 3 years ago

I apologize if this is not the appropriate channel to open this discussion.

Following the issues covered in #11477, I would like to share my findings related to dcdflib, which is used to evaluate the cumulative density function (CDF).

What can be done?

In the last two cases, a follow-up of the code's modification has to be done.

Why?

I would be happy to help in any direction you decide.

mdhaber commented 3 years ago

Hi @caos21, thanks for mentioning this. @mckib2 is actually working on replacing parts of SciPy.stats with the Boost versions in #48. Would you be interested in taking a look at that? We're not going to change everything at once; this first PR will only actually replace SciPy's beta, binom, and nbinom distributions. The idea is to get all the machinery in place so that it will be easy to take things from Boost as needed in future PRs. Would this make it easy to replace cdflib with Boost's tools?

caos21 commented 3 years ago

Hi and thank you @mdhaber , I think it is reasonable. But first, I would like to inspect how involved is cdflib in all SciPy. In the meantime, I can update cdflib to V1.1 and apply all the patches and modifications done in the past. In that way, I hope nothing breaks.

Should we move this discussion to #48 ?

mdhaber commented 3 years ago

In that case, it would probably be better to open an issue or PR on the main repo, or maybe email the mailing list to get wider attention. Only a few of us are working here now.

caos21 commented 3 years ago

Perfect I will do, and after that, I will jump into boost #48 to see how can I be of use

mckib2 commented 3 years ago

@mdhaber Don't know if we're interested in still keeping this list up to date:

mdhaber commented 3 years ago

Thanks @mckib2. We've been working from Monday.com recently, but it is still good to check these off.

mdhaber commented 2 years ago

Functions/distributions we might want to borrow from Boost:

mdhaber commented 2 years ago

@tupui At a glance, these are the issues and PRs we have open for multivariate distributions. Multivariate distributions represent ~1/5 of the number of open issues and PRs and issues with the scipy.stats label.

Multivariate Distributions - 30 of the 187 issues with scipy.stats label, 9 out of 52 PRs with scipy.stats label as of 3/14/2022.

Multivariate Normal - (Fewer than 150 lines of real code has all these issues and PRs.)

PRs

Issues

New Distribution

PRs

Issues

Other

PRs

Issues

mdhaber commented 2 years ago

IIRC, several bugs involving constant input (i.e. all elements of a slice equal) have been reported. I'll collect them here as I run across them.

gh-13254 tried to address this for some functions, but I suspect it is a widespread problem.