Open hemanth0525 opened 2 months ago
Related Issues and Documentation
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
In general the math package aims to provide the functions that are in the C++ standard library <math>
.
Thanks for the feedback! I get that the math package is meant to mirror the functions in C++'s <cmath>
, but I think adding some built-in stats functions could be a nice improvement. A lot of developers deal with stats regularly, so having these in the standard library could make things easier without stepping too far from the package’s core purpose. Happy to chat more about it if needed!
Can you do some detective work to see how people are dealing with this in open source Go now? Is there some go-stats package that has a million stars on Github? Are there ten libraries that are each imported five hundred times? Seeing that something has big demand already is important for bringing something that could be in a third party library into the standard library. Otherwise this will just get closed with "write a third party library." Which has certainly happened to me more than once!
I’ve done some digging into how statistical functions are currently being handled in the Go community. While libraries like Gonum and others provide statistical methods, there's no single source of truth or dominant package in this space, and many are designed for more complex or specialized tasks. However, the basic statistical functions we're proposing—like Mean
, Median
, Mode
, Variance
, and StdDev
—are foundational for a wide range of applications, from simple data analysis to more advanced scientific and financial computations.
By integrating these into the standard library, we'd eliminate the need for external dependencies for basic tasks, which is in line with Go's philosophy of having a strong standard library for common use cases. While third-party packages are an option, including these functions in the math
package would make Go more self-sufficient for everyday statistical needs, benefiting developers who want a simple, reliable way to compute these without resorting to third-party solutions.
for common use cases
this is the part where we need to see evidence. especially considering the existence of libraries like gonum, how often does the need arise for functions like those proposed where you wouldn't need the extra functionality that other libraries provide.
For what it's worth, python has a statistics package in its standard library: https://docs.python.org/3/library/statistics.html
It would be nice to have a simple package everyone agrees on for common use cases, but that doesn't necessarily need to be in std.
These functions sound pretty simple, but I think there's actually a lot of subtlety here. For instance, what does Mean
do for rounding? Do we need to use Kahan's algorithm? What if the sum at some point rounds up to +Inf?
Can you do some detective work to see how people are dealing with this in open source Go now? Is there some go-stats package that has a million stars on Github? Are there ten libraries that are each imported five hundred times? Seeing that something has big demand already is important for bringing something that could be in a third party library into the standard library. Otherwise this will just get closed with "write a third party library." Which has certainly happened to me more than once
in my experience everytime some numeric problems comes up gonum lib is suggested. they have a stats package https://pkg.go.dev/gonum.org/v1/gonum@v0.15.1/stat
Can you do some detective work to see how people are dealing with this in open source Go now? Is there some go-stats package that has a million stars on Github? Are there ten libraries that are each imported five hundred times? Seeing that something has big demand already is important for bringing something that could be in a third party library into the standard library. Otherwise this will just get closed with "write a third party library." Which has certainly happened to me more than once
in my experience everytime some numeric problems comes up gonum lib is suggested. they have a stats package https://pkg.go.dev/gonum.org/v1/gonum@v0.15.1/stat
Yeah, so think about having it's functionalities in go std lib straight away !
Gonum library is indeed often suggested for statistical and numerical work in Go, and it has a dedicated stat
package. It’s a robust library that covers a wide range of statistical functions, and for more complex needs, it's definitely a go-to solution.
However, my proposal is focused on adding foundational statistical functions like Mean
, Median
, Mode
, Variance
, and StdDev
,... directly into the standard library. These are basic but essential tools that many developers need in day-to-day tasks, and having them in the standard library could save developers from importing an entire external library like Gonum for simple calculations. I believe integrating these functions would make Go more self-sufficient, particularly for developers who need straightforward statistical calculations without additional dependencies.
IMHO these functions would be very useful in the standard library, even if (or indeed, because) the implementation requires some care. There are many "quick" uses of these basic stats operations in testing, benchmarking, and writing CL descriptions that shouldn't require a heavyweight dependency on a fully-featured third-party stats library. (I often end up moving data out of my Go program to the shell and running the github.com/nferraz/st command.)
Another function I would like is Percentile(n, series), which reports the nth percentile value of a given series.
If it belongs in std
, it should probably be in a "math/stats"
or "math/statistics"
instead of directly in "math"
.
Here is a small experience report with existing stats packages: In some code I was using gonum’s stats package, and a collaborator started using github.com/montanaflynn/stats as well, whose API returns an error (which I felt was annoying.) Luckily, I caught the unnecessary dependency in code review.
These are the types of things that can easily cause unnecessary dependencies to get added in projects. Hence, I think adding common statistics functions would be a great addition to the std.
It seems like a lot of developers will benefit from this !!
Can I know the update on this proposal ??_
The proposal review committee will likely look at it this week. It usually takes a few rounds to reach a final decision.
The proposal review committee will likely look at it this week. It usually takes a few rounds to reach a final decision.
OK, Cool !
Can I know the update on this proposal please ?
Sorry, we didn't get to it last week, but perhaps will this week.
Yes Please....
Some of the questions raised in the meeting were:
math
package aligns with the C++ math package, so it does not seem the appropriate home. Perhaps math/stats
? But this might create a temptation to add a lot more statistical functions. Which leads to:Thanks for the feedback! I totally get the concerns and here’s my take:
Package Location: I agree that a new math/stats
package makes sense. It keeps things organized and prevents the core math
package from becoming too broad. We can start with the basics—mean, median, mode, variance, etc.—covering foundational stats functions that are universally useful.
Scope: Let’s keep it simple for now. The goal should be to provide common, practical functions that people need for everyday testing, benchmarking, and basic analytics. We don’t need to cover advanced statistical methods yet—just the essentials. And yeah !, potential addons would be [ Percentile, Quartiles, Geometric Mean, Harmonic Mean, Mean Absolute Deviation (MAD), Coefficient of Variation (CV), Cumulative Sum (Cumsum), Root Mean Square (RMS), Skewness, Kurtosis, Covariance, Correlation Coefficient, Z-Score, ..... ]
Generics: I don’t think we need generics here. Users can convert integers to floats if needed, and keeping it focused on simplicity will make the package more accessible.
Mode Function: For cases like [1, 2]
, we can return nil
or an empty slice []
if no mode exists, or return all modes in a slice when there’s more than one. That way, it’s clear and flexible.
Overall, I think this keeps the package lightweight, practical, and easy to use, which should be the priority. Looking forward to hearing your thoughts!
And yeah potential addons would be Percentile, ...[long list]...
I think the goal of limiting the scope would be to ensure that these (other than Percentile) are not potential additions. ;-)
I agree that a slice result for Mode seems appropriate. Perhaps it should be called Modes.
Yeah gotcha !!
Totally agree..
The python lib has a good scope set in its description (essentially "what you'd find on a calculator")
I think the main goal of making these functions generic would be to support either float32
or float64
.
I don't see a reason for these to not be generic. accepting a slice of Integers does not bring complexity in the implementation and it would be a pity that even now we would need to convert manually when these are naturally functions over numeric values.
For example many surveys would naturally have integer slice values. depending on the size of the input it could be a significant allocation factor. Speaking of allocation, would it be useful to also have a *Iter variant of the methods that takes an Iterator instead of slices?
This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings.
I'm working on this. Could you please assign it to me ??
@hemanth0525 It doesn't make much sense to work on something that isn't yet accepted by the proposal process.
@hemanth0525 It doesn't make much sense to work on something that isn't yet accepted by the proposal process.
Initially, I created a pull request without realizing that Go follows a formal proposal process. Afterward, I submitted the required proposal.
@hemanth0525
Yes, see https://github.com/golang/proposal. This proposal is now in the 'Active' state, meaning it receives a few minutes of consideration by the review group every week (typically on Wednesdays if I recall correctly). It can take anywhere from several weeks to months before a final decision is made.
If you'd like to do something, you can update the initial proposal by adding the proposed function signatures (without implementation) along with their doc comments to reflect the current state of the discussion. For an example, see https://github.com/golang/go/issues/45955#issue-875940635
@gophun Yeah I understand thank you !
accepting a slice of Integers does not bring complexity in the implementation
What's the mean of []int{2, 3}
then? The integer 2
(when naively computing (2 + 3)/2
as int
) or float64(2.5)
? Always returning float64
doesn't sound right either, as you want to return float32
if the input is []float32
.
That's just a more obvious form of the same problem at the bounds of float precision, right?
+1 to Percentile
as that's less obvious to implement correctly, but actually useful. And you could implement Median
as just calling Percentile(x, 0.5)
.
Percentile
is tempting, but not nearly as universally agreed upon as these other operations. There's a standard taxonomy of nine different definitions of percentile/quantile. Maybe there's value in being opinionated here, or maybe it's an attractive nuisance.
Can I know the status ??
Can I know the status ??
This is the status:
It can take anywhere from several weeks to months before a final decision is made.
Can I receive any updates, at least weekly?
@hemanth0525 I appreciate this issue is important to you. Please understand that we have over 700 proposals waiting for attention, as can be seen at https://github.com/orgs/golang/projects/17. It's not feasible for our small team to provide weekly updates for each separate proposal. You can track the proposal review activities at #33502.
@ianlancetaylor Yes I get it, Thanks !!
The scope of this package should be fairly narrow. If you search for "basic descriptive statistics", basically all results include mean, median, mode, and standard deviation. Variance is also common. "Range" is pretty common, but that's easy to get with the min
and max
built-ins. Most include some form of quantile/percentile/quartile.
The Python statistics package is an interesting example here (thanks @jimmyfrasche), as it aims to be a small collection of common operations. However, I think it actually goes too far. I was particularly surprised to see kernel density estimation in there, as I consider that, and especially picking good KDE parameters, a fairly advanced statistical method.
math/stats
could invite feature creep. On the other hand, it's scoped and purposeful. It's also easier to search for.
math
currently follows the C library, but I'm not convinced that's very important (Go isn't C). However, everything in math operates on one or two float64s, so this would be a break from that. math
already mixes together a few different fields (e.g., there's no math/trig
), but that's probably just because it follows the C math library. It already had a few other sub-packages for different data types (math/cmplx
) and specific fields (math/bits
).
Overall I'm leaning toward math/stats
.
Quantile: I personally find myself wanting quantiles quite often, so this is certainly tempting. We should get a statistics expert to weigh in on which definition to use. I do think this should be "quantile" and not "percentile".
Variance and standard deviation: Are these for populations or do they apply sample correction? Do we provide both a population form and a sample-corrected form (this is what Python does)? If we're going to provide sample forms, which of the various corrections do we use?
Mode: I'm not completely convinced that we should include mode. If we do, I'd suggest only including "multimode", which returns a possibly-nil slice, as this is a total function, unlike mode.
Quantile: I personally find myself wanting quantiles quite often, so this is certainly tempting. We should get a statistics expert to weigh in on which definition to use. I do think this should be "quantile" and not "percentile".
Meaning the parameter should be in [0,1] not [0,100]? Or that one should provide lower and upper bounds for the portion of the CDF of interest?
Variance and standard deviation: Are these for populations or do they apply sample correction? Do we provide both a population form and a sample-corrected form (this is what Python does)? If we're going to provide sample forms, which of the various corrections do we use?
I would think that population is more in line with the typical use of such a package, but it may be safer to provide both with distinct names, preventing casual use of the wrong one. The doc comments should provide clear examples of which one is appropriate.
Mode: I'm not completely convinced that we should include mode. If we do, I'd suggest only including "multimode", which returns a possibly-nil slice, as this is a total function, unlike mode.
I agree; I proposed Modes([]float) []float
to acknowledge its multiplicity up front.
About the different ways to compute quantiles: R, which is very mainstream in statistics, implements 9 different quantile algorithms and lets the user choose. Documentation is at https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/quantile . (I didn't check whether this is the same list of methods as in the Wikipedia article quoted above.)
I'm not sure about the proposed API. Specifically, it seems to me that these should arguably take iter.Seq[float64]
instead of []float64
, from a pure API perspective. But if you need more than one of these outputs (which I would assume you commonly do), iter.Seq
makes it clear that it's less efficient to iterate multiple times. Instead of having a single loop that does a bunch of accumulations. The same concern ultimately exists with slices, it's just less obvious.
So to me, this API only really makes sense for small data series. Where the cost of looping multiple times is negligible and/or you are fine with pre-allocating them. An API to remedy that is arguably too complex for the stdlib.
Are we okay with that limitation? If so, should we still make the arguments iter.Seq
?
Another design would a single value with methods for all the various stats and whose factories take the slice or sequence (and any weights). That way it could do any sorting or failing on NaN upfront and cache any intermediate values required by multiple stats.
Something like
stats, err := statistics.For(floats)
// handle err
fmt.Println(stats.Mean(), stats.Max(), stats.Median())
Are we okay with that limitation [that the cost of looping multiple times is negligible and/or you are fine with pre-allocating them]? If so, should we still make the arguments
iter.Seq
?
Though an iter.Seq[float64]
is the logical parameter type, I suspect it is not only less efficient (because of repeated passes) but also less convenient (because typically one has a slice already). Although an iterator would allow the caller to avoid materializing an array of float64 when they have some other data structure (such as an array of integers or structs), I suspect the work to define a one-off iterator over that data structure is probably still more than to create a []float64
slice from it. So, []float64
is probably more convenient. And as you point out, if multiple statistics are required, it may be more efficient too, but that's a secondary concern.
Another design would a single value with methods for all the various stats
There's a tantalizing idea here that perhaps one could just call fmt.Println(stats.For(series))
and obtain a nice string showing the mean, median, percentiles and so on, not unlike the convenience of fmt.Println(time.Since(t0))
. But the Percentile operator requires an argument (0.9. 0.99, etc). I think the API originally proposed is simpler.
@adonovan it could not print percentiles or just print quartiles and you have to ask if you need something more specific.
My main thought with the API is that it makes it clear that it's taking ownership. I'm guessing in most cases you want more than one stat at a time so if it can cache some intermediary value that gets used for more than one stat or speed things up by storing the numbers in a special order or data structure that's a nice bonus. I don't know what the specific numerical methods are used for stats but I imagine there could be some savings by caching the sum or just knowing if there's a +Inf in there somewhere.
Description:
This proposal aims to enhance the Go standard library’s
math
(math/stats.go
)package by introducing several essential statistical functions. The proposed functions are:Motivation:
The inclusion of these statistical functions directly in the
math
package will offer Go developers robust tools for data analysis and statistical computation, enhancing the language's utility in scientific and financial applications. Currently, developers often rely on external libraries for these calculations, which adds dependencies and potential inconsistencies. Integrating these functions into the standard library will:Design:
The functions will be added to the existing
math
package, ensuring they are easy to use and integrate seamlessly with other mathematical operations. Detailed documentation and examples will be provided to illustrate their usage and edge case handling.Examples: