haskell / statistics

A fast, high quality library for computing with statistics in Haskell.
http://hackage.haskell.org/package/statistics
BSD 2-Clause "Simplified" License
300 stars 67 forks source link

More variants of linear regression needed #67

Open Shimuuar opened 10 years ago

Shimuuar commented 10 years ago

We need more variants of linear regeressions:

  1. Regression for values with known normal errors. Here we'll obtain not only fit result but estimate of errors and χ² goodness of fit test. I think we need multivarite normal distributuion and special data type for description of values with normal error.
  2. Regression with linear combination of several functions. (In R it's known as glm). It's another way to look at same regression problem but I think it deserves addition to the library as well.
raibread commented 7 years ago

I've forked the repository and started working a bit on part 1. of this. I think it makes more practical sense, however, to work in the case when error is known to be normal but with unknown variance. So we'd end up with t-distributed estimates and an F test for goodness of fit. This is the standard approach taken by R when using lm. This might require a data type that is similar to the existing NormalErr data type but includes degrees of freedom... maybe one of the parameters of TErr would be the estimate of sigma and the other would be the reference distribution of type studentT.

Shimuuar commented 7 years ago

I assume you're talking about case when error is unknown but normally distributed and same for every point, right? So you can try to estimate both best fit and variance of data points.

However case when errors are known is very important in practice. Usually it's used when you have error estimate for each point and they vary from point to point. Very common in experimental physics

raibread commented 7 years ago

That's correct. From the statisticians point of view, this is probably the most natural extension of olsRegress. I'm nearly done with this (it's called normalRegress in my fork) and am currently addressing the second task mentioned in this issue:

"Regression with linear combination of several functions" sounds more like a transformed-data model where you transform your predictors before running standard linear regression. This is quite different from what glm accomplishes. glm extends linear regression to models with non-normal response data (e.g. Poisson, Binomial, etc.) by modeling a transformed mean function as a linear combination of predictors by way of an appropriate link function given the response distribution.

Which were you intending?


So in these experimental physics applications you take have a priori error estimates which you take as truth? Can the errors take arbitrary correlation structure? This should be a (mostly) straightforward implementation of the generalized least squares algorithm. If I'm understanding you correctly, in addition to giving the algorithm response and covariates you would also give a covariance matrix for the errors, right?

Shimuuar commented 7 years ago

Strangely enough I never encountered this variant of regression before. However I heard about some attempt to estimate errors from such fit.

At the moment most important deficiency is complete lack of documentation. Undocumented statistical algorithm (especially uncommon one) just does some weird calculation and produces result that mean nothing.


I meant regressions of the form α·f(x) + β·g(x) ... so I probably mistook glm for something else. Thank you for the correction. But if you decide to implement glm I'll gladly accept patches


Usually measurement errors are presumed to be independent. Indeed it difficult to imagine situation when they're correlated. Actually this issue was intended more as need for convenient API rather that ne algorithms.