golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.22k stars 17.7k forks source link

proposal: math: add Mean, Median, Mode, Variance, and StdDev #69264

Open hemanth0525 opened 2 months ago

hemanth0525 commented 2 months ago

Description:

This proposal aims to enhance the Go standard library’s math ( math/stats.go )package by introducing several essential statistical functions. The proposed functions are:

Motivation:

The inclusion of these statistical functions directly in the math package will offer Go developers robust tools for data analysis and statistical computation, enhancing the language's utility in scientific and financial applications. Currently, developers often rely on external libraries for these calculations, which adds dependencies and potential inconsistencies. Integrating these functions into the standard library will:

Design:

The functions will be added to the existing math package, ensuring they are easy to use and integrate seamlessly with other mathematical operations. Detailed documentation and examples will be provided to illustrate their usage and edge case handling.

Examples:

gabyhelp commented 2 months ago

Related Issues and Documentation

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

ianlancetaylor commented 2 months ago

In general the math package aims to provide the functions that are in the C++ standard library <math>.

hemanth0525 commented 2 months ago

Thanks for the feedback! I get that the math package is meant to mirror the functions in C++'s <cmath>, but I think adding some built-in stats functions could be a nice improvement. A lot of developers deal with stats regularly, so having these in the standard library could make things easier without stepping too far from the package’s core purpose. Happy to chat more about it if needed!

earthboundkid commented 2 months ago

Can you do some detective work to see how people are dealing with this in open source Go now? Is there some go-stats package that has a million stars on Github? Are there ten libraries that are each imported five hundred times? Seeing that something has big demand already is important for bringing something that could be in a third party library into the standard library. Otherwise this will just get closed with "write a third party library." Which has certainly happened to me more than once!

hemanth0525 commented 2 months ago

I’ve done some digging into how statistical functions are currently being handled in the Go community. While libraries like Gonum and others provide statistical methods, there's no single source of truth or dominant package in this space, and many are designed for more complex or specialized tasks. However, the basic statistical functions we're proposing—like Mean, Median, Mode, Variance, and StdDev—are foundational for a wide range of applications, from simple data analysis to more advanced scientific and financial computations.

By integrating these into the standard library, we'd eliminate the need for external dependencies for basic tasks, which is in line with Go's philosophy of having a strong standard library for common use cases. While third-party packages are an option, including these functions in the math package would make Go more self-sufficient for everyday statistical needs, benefiting developers who want a simple, reliable way to compute these without resorting to third-party solutions.

seankhliao commented 2 months ago

for common use cases

this is the part where we need to see evidence. especially considering the existence of libraries like gonum, how often does the need arise for functions like those proposed where you wouldn't need the extra functionality that other libraries provide.

jimmyfrasche commented 2 months ago

For what it's worth, python has a statistics package in its standard library: https://docs.python.org/3/library/statistics.html

It would be nice to have a simple package everyone agrees on for common use cases, but that doesn't necessarily need to be in std.

randall77 commented 2 months ago

These functions sound pretty simple, but I think there's actually a lot of subtlety here. For instance, what does Mean do for rounding? Do we need to use Kahan's algorithm? What if the sum at some point rounds up to +Inf?

doggedOwl commented 2 months ago

Can you do some detective work to see how people are dealing with this in open source Go now? Is there some go-stats package that has a million stars on Github? Are there ten libraries that are each imported five hundred times? Seeing that something has big demand already is important for bringing something that could be in a third party library into the standard library. Otherwise this will just get closed with "write a third party library." Which has certainly happened to me more than once

in my experience everytime some numeric problems comes up gonum lib is suggested. they have a stats package https://pkg.go.dev/gonum.org/v1/gonum@v0.15.1/stat

hemanth0525 commented 2 months ago

Can you do some detective work to see how people are dealing with this in open source Go now? Is there some go-stats package that has a million stars on Github? Are there ten libraries that are each imported five hundred times? Seeing that something has big demand already is important for bringing something that could be in a third party library into the standard library. Otherwise this will just get closed with "write a third party library." Which has certainly happened to me more than once

in my experience everytime some numeric problems comes up gonum lib is suggested. they have a stats package https://pkg.go.dev/gonum.org/v1/gonum@v0.15.1/stat

Yeah, so think about having it's functionalities in go std lib straight away !

hemanth0525 commented 2 months ago

Gonum library is indeed often suggested for statistical and numerical work in Go, and it has a dedicated stat package. It’s a robust library that covers a wide range of statistical functions, and for more complex needs, it's definitely a go-to solution.

However, my proposal is focused on adding foundational statistical functions like Mean, Median, Mode, Variance, and StdDev,... directly into the standard library. These are basic but essential tools that many developers need in day-to-day tasks, and having them in the standard library could save developers from importing an entire external library like Gonum for simple calculations. I believe integrating these functions would make Go more self-sufficient, particularly for developers who need straightforward statistical calculations without additional dependencies.

adonovan commented 2 months ago

IMHO these functions would be very useful in the standard library, even if (or indeed, because) the implementation requires some care. There are many "quick" uses of these basic stats operations in testing, benchmarking, and writing CL descriptions that shouldn't require a heavyweight dependency on a fully-featured third-party stats library. (I often end up moving data out of my Go program to the shell and running the github.com/nferraz/st command.)

Another function I would like is Percentile(n, series), which reports the nth percentile value of a given series.

jimmyfrasche commented 2 months ago

If it belongs in std, it should probably be in a "math/stats" or "math/statistics" instead of directly in "math".

meling commented 2 months ago

Here is a small experience report with existing stats packages: In some code I was using gonum’s stats package, and a collaborator started using github.com/montanaflynn/stats as well, whose API returns an error (which I felt was annoying.) Luckily, I caught the unnecessary dependency in code review.

These are the types of things that can easily cause unnecessary dependencies to get added in projects. Hence, I think adding common statistics functions would be a great addition to the std.

hemanth0525 commented 2 months ago

It seems like a lot of developers will benefit from this !!

hemanth0525 commented 2 months ago

Can I know the update on this proposal ??_

adonovan commented 2 months ago

The proposal review committee will likely look at it this week. It usually takes a few rounds to reach a final decision.

hemanth0525 commented 2 months ago

The proposal review committee will likely look at it this week. It usually takes a few rounds to reach a final decision.

OK, Cool !

hemanth0525 commented 1 month ago

Can I know the update on this proposal please ?

adonovan commented 1 month ago

Sorry, we didn't get to it last week, but perhaps will this week.

hemanth0525 commented 1 month ago

Yes Please....

adonovan commented 1 month ago

Some of the questions raised in the meeting were:

hemanth0525 commented 1 month ago

Thanks for the feedback! I totally get the concerns and here’s my take:

  1. Package Location: I agree that a new math/stats package makes sense. It keeps things organized and prevents the core math package from becoming too broad. We can start with the basics—mean, median, mode, variance, etc.—covering foundational stats functions that are universally useful.

  2. Scope: Let’s keep it simple for now. The goal should be to provide common, practical functions that people need for everyday testing, benchmarking, and basic analytics. We don’t need to cover advanced statistical methods yet—just the essentials. And yeah !, potential addons would be [ Percentile, Quartiles, Geometric Mean, Harmonic Mean, Mean Absolute Deviation (MAD), Coefficient of Variation (CV), Cumulative Sum (Cumsum), Root Mean Square (RMS), Skewness, Kurtosis, Covariance, Correlation Coefficient, Z-Score, ..... ]

  3. Generics: I don’t think we need generics here. Users can convert integers to floats if needed, and keeping it focused on simplicity will make the package more accessible.

  4. Mode Function: For cases like [1, 2], we can return nil or an empty slice [] if no mode exists, or return all modes in a slice when there’s more than one. That way, it’s clear and flexible.

Overall, I think this keeps the package lightweight, practical, and easy to use, which should be the priority. Looking forward to hearing your thoughts!

adonovan commented 1 month ago

And yeah potential addons would be Percentile, ...[long list]...

I think the goal of limiting the scope would be to ensure that these (other than Percentile) are not potential additions. ;-)

I agree that a slice result for Mode seems appropriate. Perhaps it should be called Modes.

hemanth0525 commented 1 month ago

Yeah gotcha !!

Totally agree..

jimmyfrasche commented 1 month ago

The python lib has a good scope set in its description (essentially "what you'd find on a calculator")

ianlancetaylor commented 1 month ago

I think the main goal of making these functions generic would be to support either float32 or float64.

doggedOwl commented 1 month ago

I don't see a reason for these to not be generic. accepting a slice of Integers does not bring complexity in the implementation and it would be a pity that even now we would need to convert manually when these are naturally functions over numeric values.

For example many surveys would naturally have integer slice values. depending on the size of the input it could be a significant allocation factor. Speaking of allocation, would it be useful to also have a *Iter variant of the methods that takes an Iterator instead of slices?

aclements commented 1 month ago

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings.

hemanth0525 commented 1 month ago

I'm working on this. Could you please assign it to me ??

gophun commented 1 month ago

@hemanth0525 It doesn't make much sense to work on something that isn't yet accepted by the proposal process.

hemanth0525 commented 1 month ago

@hemanth0525 It doesn't make much sense to work on something that isn't yet accepted by the proposal process.

Initially, I created a pull request without realizing that Go follows a formal proposal process. Afterward, I submitted the required proposal.

gophun commented 1 month ago

@hemanth0525

Yes, see https://github.com/golang/proposal. This proposal is now in the 'Active' state, meaning it receives a few minutes of consideration by the review group every week (typically on Wednesdays if I recall correctly). It can take anywhere from several weeks to months before a final decision is made.

If you'd like to do something, you can update the initial proposal by adding the proposed function signatures (without implementation) along with their doc comments to reflect the current state of the discussion. For an example, see https://github.com/golang/go/issues/45955#issue-875940635

hemanth0525 commented 1 month ago

@gophun Yeah I understand thank you !

jdemeyer commented 1 month ago

accepting a slice of Integers does not bring complexity in the implementation

What's the mean of []int{2, 3} then? The integer 2 (when naively computing (2 + 3)/2 as int) or float64(2.5)? Always returning float64 doesn't sound right either, as you want to return float32 if the input is []float32.

tianon commented 1 month ago

That's just a more obvious form of the same problem at the bounds of float precision, right?

jdemeyer commented 1 month ago

+1 to Percentile as that's less obvious to implement correctly, but actually useful. And you could implement Median as just calling Percentile(x, 0.5).

aclements commented 1 month ago

Percentile is tempting, but not nearly as universally agreed upon as these other operations. There's a standard taxonomy of nine different definitions of percentile/quantile. Maybe there's value in being opinionated here, or maybe it's an attractive nuisance.

hemanth0525 commented 1 month ago

Can I know the status ??

adonovan commented 1 month ago

Can I know the status ??

This is the status:

It can take anywhere from several weeks to months before a final decision is made.

hemanth0525 commented 1 month ago

Can I receive any updates, at least weekly?

ianlancetaylor commented 1 month ago

@hemanth0525 I appreciate this issue is important to you. Please understand that we have over 700 proposals waiting for attention, as can be seen at https://github.com/orgs/golang/projects/17. It's not feasible for our small team to provide weekly updates for each separate proposal. You can track the proposal review activities at #33502.

hemanth0525 commented 1 month ago

@ianlancetaylor Yes I get it, Thanks !!

aclements commented 3 weeks ago

What's the scope?

The scope of this package should be fairly narrow. If you search for "basic descriptive statistics", basically all results include mean, median, mode, and standard deviation. Variance is also common. "Range" is pretty common, but that's easy to get with the min and max built-ins. Most include some form of quantile/percentile/quartile.

The Python statistics package is an interesting example here (thanks @jimmyfrasche), as it aims to be a small collection of common operations. However, I think it actually goes too far. I was particularly surprised to see kernel density estimation in there, as I consider that, and especially picking good KDE parameters, a fairly advanced statistical method.

Which package?

math/stats could invite feature creep. On the other hand, it's scoped and purposeful. It's also easier to search for.

math currently follows the C library, but I'm not convinced that's very important (Go isn't C). However, everything in math operates on one or two float64s, so this would be a break from that. math already mixes together a few different fields (e.g., there's no math/trig), but that's probably just because it follows the C math library. It already had a few other sub-packages for different data types (math/cmplx) and specific fields (math/bits).

Overall I'm leaning toward math/stats.

Operations

Quantile: I personally find myself wanting quantiles quite often, so this is certainly tempting. We should get a statistics expert to weigh in on which definition to use. I do think this should be "quantile" and not "percentile".

Variance and standard deviation: Are these for populations or do they apply sample correction? Do we provide both a population form and a sample-corrected form (this is what Python does)? If we're going to provide sample forms, which of the various corrections do we use?

Mode: I'm not completely convinced that we should include mode. If we do, I'd suggest only including "multimode", which returns a possibly-nil slice, as this is a total function, unlike mode.

adonovan commented 3 weeks ago

Quantile: I personally find myself wanting quantiles quite often, so this is certainly tempting. We should get a statistics expert to weigh in on which definition to use. I do think this should be "quantile" and not "percentile".

Meaning the parameter should be in [0,1] not [0,100]? Or that one should provide lower and upper bounds for the portion of the CDF of interest?

Variance and standard deviation: Are these for populations or do they apply sample correction? Do we provide both a population form and a sample-corrected form (this is what Python does)? If we're going to provide sample forms, which of the various corrections do we use?

I would think that population is more in line with the typical use of such a package, but it may be safer to provide both with distinct names, preventing casual use of the wrong one. The doc comments should provide clear examples of which one is appropriate.

Mode: I'm not completely convinced that we should include mode. If we do, I'd suggest only including "multimode", which returns a possibly-nil slice, as this is a total function, unlike mode.

I agree; I proposed Modes([]float) []float to acknowledge its multiplicity up front.

seehuhn commented 2 weeks ago

About the different ways to compute quantiles: R, which is very mainstream in statistics, implements 9 different quantile algorithms and lets the user choose. Documentation is at https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/quantile . (I didn't check whether this is the same list of methods as in the Wikipedia article quoted above.)

Merovius commented 1 week ago

I'm not sure about the proposed API. Specifically, it seems to me that these should arguably take iter.Seq[float64] instead of []float64, from a pure API perspective. But if you need more than one of these outputs (which I would assume you commonly do), iter.Seq makes it clear that it's less efficient to iterate multiple times. Instead of having a single loop that does a bunch of accumulations. The same concern ultimately exists with slices, it's just less obvious.

So to me, this API only really makes sense for small data series. Where the cost of looping multiple times is negligible and/or you are fine with pre-allocating them. An API to remedy that is arguably too complex for the stdlib.

Are we okay with that limitation? If so, should we still make the arguments iter.Seq?

jimmyfrasche commented 1 week ago

Another design would a single value with methods for all the various stats and whose factories take the slice or sequence (and any weights). That way it could do any sorting or failing on NaN upfront and cache any intermediate values required by multiple stats.

Something like

stats, err := statistics.For(floats)
// handle err
fmt.Println(stats.Mean(), stats.Max(), stats.Median())
adonovan commented 1 week ago

Are we okay with that limitation [that the cost of looping multiple times is negligible and/or you are fine with pre-allocating them]? If so, should we still make the arguments iter.Seq?

Though an iter.Seq[float64] is the logical parameter type, I suspect it is not only less efficient (because of repeated passes) but also less convenient (because typically one has a slice already). Although an iterator would allow the caller to avoid materializing an array of float64 when they have some other data structure (such as an array of integers or structs), I suspect the work to define a one-off iterator over that data structure is probably still more than to create a []float64 slice from it. So, []float64 is probably more convenient. And as you point out, if multiple statistics are required, it may be more efficient too, but that's a secondary concern.

Another design would a single value with methods for all the various stats

There's a tantalizing idea here that perhaps one could just call fmt.Println(stats.For(series)) and obtain a nice string showing the mean, median, percentiles and so on, not unlike the convenience of fmt.Println(time.Since(t0)). But the Percentile operator requires an argument (0.9. 0.99, etc). I think the API originally proposed is simpler.

jimmyfrasche commented 1 week ago

@adonovan it could not print percentiles or just print quartiles and you have to ask if you need something more specific.

My main thought with the API is that it makes it clear that it's taking ownership. I'm guessing in most cases you want more than one stat at a time so if it can cache some intermediary value that gets used for more than one stat or speed things up by storing the numbers in a special order or data structure that's a nice bonus. I don't know what the specific numerical methods are used for stats but I imagine there could be some savings by caching the sum or just knowing if there's a +Inf in there somewhere.